NATIONAL BUREAU OF STANDARDS
microcopy resolution test chart
DIRECT AND ITERATIVE METHODS
FOR BLOCK TRIDIAGONAL LINEAR SYSTEMS
Don Eric Heller
April 1977
DEPARTMENT
of
COMPUTER SCIENCE
D D C
?f7ar?mp|z
NOV 22 19TT
gSEDTTE
Carnegie -Mel Ion University
[ON STATEMENT A
proved for public release;
Distribution Unlimited
ABSTRACT
Block tridiagonal systems of linear equations occur frequently in
scientific computations, often forming the core of more complicated prob-
lems. Numerical methods for solution of such systems are studied with
emphasis on efficient methods for a vector computer. A convergence theory
for direct methods under conditions of block diagonal dominance is developed,
demonstrating stability, convergence and approximation properties of direct
methods. Block elimination (LU factorization) is linear, cyclic odd-even
reduction is quadratic, and higher-order methods exist. The odd-even
methods are variations of the quadratic Newton iteration for the inverse
matrix, and are the only quadratic methods within a certain reasonable
class of algorithms. Serai-direct methods based on the quadratic conver-
gence of odd-even reduction prove useful in combination with linear itera-
tions for an approximate solution. An execution time analysis for a pipe-
line computer is given, with attention to storage requirements and the
effect of machine constraints on vector operations.
PREFACE
It is with many thanks that I acknowledge Professor J. F. Traub, for
my introduction to numerical mathematics and parallel computation, and for
his support and encouragement. Enlightening conversations with G. J. Fix,
S. H. Fuller, H. T. Rung and D. K. Stevenson have contributed to some of
the ideas presented in the thesis. A number of early results were obtained
while in residence at ICASE , NASA Langley Research Center; J. M. Ortega and
R. G. Voigt were most helpful in clarifying these results. The thesis was
completed while at the Pennsylvania State University, and P. C. Fischer
must be thanked for his forbearance.
Of course, the most important contribution has been that of my wife,
Molly, who has put up with more than her fair share.
ii
CONTENTS
page
1. Introduction . 1
A. Summary of Main Results 5
B. Notation 8
2. Some Initial Remarks 10
A. Analytic Tools 10
B. Models of Parallel Computation 14
3 . Linear Methods 18
A. The LU Factorization 18
1. Block Elimination 18
2. Further Properties of Block Elimination 25
3. Back Substitution 30
4. Special Techniques 31
B. Gauss-Jordan Elimination 33
C. Parallel Computation 36
4. Quadratic Methods 37
A. Odd- Even Elimination 37
B. Variants of Odd-Even Elimination 45
C. Odd- Even Reduction 48
1. The Basic Algorithm 48
2. Relation to Block Elimination 51
3. Back Substitution 56
4. Storage Requirements 58
5. Special Techniques 59
D. Parallel Computation 60
E. Summary 64
5. Unification: Iteration and Fill-In 65
iii
\
6. Higher- Order Methods 71
A. p-Fold Reduction 71
B. Convergence Rates 75
C. Back Substitution 80
7. Semidirect Methods 83
A. Incomplete Elimination 84
B. Incomplete Reduction 86
C. Applicability of the Methods 90
8. Iterative Methods 92
A. Use of Elimination 92
B. Splittings 93
C. Multi- line Orderings 94
D. Spiral Orderings 97
E. Parallel Gauss 101
9. Applications 105
A. Curve Fitting 105
B. Finite Elements 106
10. Implementation of Algorithms 110
A. Choice of Algorithms (Generalities). 110
B. Storage Requirements 113
C. Comparative Timing Analysis 120
Appendix A. Vector Computer Instructions 131
Appendix B. Summary of Algorithms and Timings 136
References 156
iv
1. INTRODUCTION
Block tridiagonal matrices are a special class of matrices which arise
in a variety of scientific and engineering computations, typically in the
numerical solution of differential equations. For now it is sufficient to
say that the matrix looks like
where b. is an n. X n. matrix and a., c. are dimensioned conformally. Thus
1 11 -N1 1 N
the full matrix A has dimension (, n.) x ( n.).
j-1 J ‘">1 J
In this thesis we study numerical methods for solution of a block tri-
diagonal linear system Ax = v, with emphasis on efficient methods for a
vector-oriented parallel computer (e.g., CDC STAR- 100, Illiac IV). Our
analysis primarily concerns numerical properties of the algorithms, with
discussion of their inherent parallelism and areas of application. For
this reason many of our results also apply to standard sequential algo-
rithms. A unifying analytic theme is that a wide variety of direct and
iterative methods can be viewed as special cases of a general matrix itera-
tion, and we make considerable use of a convergence theory for direct
methods. Algorithms are compared using execution time estimates for a
simplified model of the CDC STAR- 100 and recommendations are made on this
basis .
2
There are two important subclasses of block tridiagonal matrices,
depending on whether the blocks are small and dense or large and sparse.
A computational method to solve the linear system Ax = v should take
into account the internal structure of the blocks in order to obtain
storage economy and a low operation count. Moreover, there are impor-
tant applications in which an approximate solution is satisfactory.
Such considerations often motivate iterative methods for large sparse
systems, especially when a direct method would be far more expensive.
Since 1965, however, there has been an increased interest in the
direct solution of special systems derived from two-dimensional elliptic
partial differential equations. With a standard five-point finite differ-
ence approximation on a rectangular grid, the a^, c^ blocks are diagonal
and the b^ blocks are tridiagonal, so A possesses a very regular sparse
structure. By further specializing the differential equation more struc-
ture will be available. Indeed, the seminal paper for many of the impor-
tant developments dealt with the simplest possible elliptic equation. This
was Hockney's work [H5], based on an original suggestion by Golub, on the
use of Fourier transforms and the cyclic reduction algorithm for the solu-
tion of Poisson's equation on a rectangle. For an n X n square grid Hockney
3
was able to solve Ax * v in 0(n ) arithmetic operations, while ordinary band-
4 3
matrix methods required 0(n ) operations. Other 0(n ) methods for Poisson's
equation then extant had considerably larger asymptotic constants, and the
new method proved to be significantly less time-consuming in practice. Sub-
sequent discovery of the Fast Fourier Transform and Buneman's stable version
2
of cyclic reduction created fast (0(n log n) operations) and accurate
methods that attracted much attention from applications programmers and
r
numerical analysts [B20], [D2], [H6]. The Buneman algorithm has since
been extended to Poisson1 s equation on certain nonrectangular regions
and to general separable elliptic equations on a rectangle [S6], and well-
tested Fortran subroutine packages are available (e.g. , [S8]). Other
recent work on these problems includes the Fourier-Toeplitz methods fF2 ] ,
and Bank's generalized marching algorithms [B5]. The latter methods work
by controlling a fast unstable method, and great care must be taken to
maintain stability.
Despite the success of fast direct methods for specialized problems
there is still no completely satisfactory direct method that applies to a
general nonseparable elliptic equation. The best direct method available
for this problem is George's nested dissection [Gl], which is theoreti-
cally attractive but apparently hard to implement. Some interesting tech-
niques to improve this situation are discussed by Eisenstat, Schultz and
Sherman [El] and Rose and Whitten [R5] among others. Nevertheless, in
present practice the nonseparable case is still frequently solved by an
iteration, and excellent methods based on direct solution of the separable
case have appeared (e.g., [C4], [C5]). The multi-level adaptive techniques
recently analyzed and considerably developed by Brandt [B13] are a rather
different approach, successfully combining iteration and a sequence of dis-
cretization grids. Brandt discusses several methods suitable for parallel
computation and although his work has not been considered here, it will
undoubtedly play an important role in future studies. The state-of-the-art
for solution of partial differential equations on vector computers is sum-
marized by Ortega and Voigt [01].
References include [B19], [Dl], [S5], [S7], [S9].
4
1
With the application area of partial differential equations in mind,
the major research goal of this thesis is a general analysis of direct
block elimination methods, regardless of the sparsity of the blocks. As
a practical measure, however, the direct methods are best restricted to
cases where the blocks are small enough to be stored explicitly as full
matrices, or special enough to be handled by some storage-efficient im-
plicit representation. Proposed iterative methods for nonseparable ellip-
tic equations will be based on the ability to solve such systems efficiently.
Our investigation of direct methods includes the block LU factoriza-
tion, the cyclic reduction algorithm (called odd-even reduction here) and
Sweet's generalized cyclic reduction algorithms [S10], along with varia-
tions on these basic themes. The methods are most easily described as a
sequence of matrix transformations applied to A in order to reduce it to
block diagonal form. We show that, under conditions of block diagonal
dominance, matrix norms describing the off-diagonal blocks relative to
the diagonal blocks will not increase, decrease quadratically , or decrease
at higher rates, for each basic algorithm, respectively. Thus the algo-
rithms are naturally classified by their convergence properties. For the
cyclic reduction algorithms, quadratic convergence leads to the defini-
tion of useful semi-direct methods to approximate the solution of Ax = v.
Some of our results in the area have previously appeared in [H2].
All of the algorithms discussed here can be run, with varying degrees
of efficiency, on a parallel computer that supports vector operations. By
this we mean an array processor, such as Illiac IV, or a pipeline processor,
such as the CDC STAR- 100 or the Texas Instruments ASC. These machines
fall within the single instruction stream-multiple data stream classifica-
tion of parallel computers, and have sufficiently many common features to
L
5
make a general comparison of algorithms pc:sible. However, there are
also many special restrictions or capabilities that must be considered
when designing programs for a particular machine. Our comparisons for
a simplified model of STAR indicate how similar comparisons should be
made for other machines. For a general survey of parallel methods in
linear algebra plus background material, [H3] is recommended.
In almost every case, a solution method tailored to a particular
problem on a particular computer can be expected to be superior to a
general method naively applied to that problem and machine. However,
the costs of tailoring can be quite high. Observations about the behav-
ior of the general method can be used to guide the tailoring for a later
solution, and we believe that the results presented here will allow us
to move in that direction.
6
B[A] in the analysis of direct and semidirect methods, though it has long
been used to analyze iterative methods.
Under the assumption j| B[A]||< 1, when a direct method generates a
sequence of matrices with || B[Mi+1]|| s || B[M. ]||, we classify the method
as linear, and if || B[M^+^]|| £ || B[M^]“|| or j| B [M_^ ] 1 1 we say that it is
quadratic. The ‘“-norm is used throughout; extension to other norms would
be desirable but proves quite difficult.
The block LU factorization and block Gauss-Jordan elimination are
shown to be linear methods, and their properties are investigated in Sec-
tion 3. Sample results are stability of the block factorization when
|| B [ A ] j j < 1, bounds on the condition numbers of the pivotal blocks, and
convergence to zero of portions of the matrix in the Gauss-Jordan algorithm.
The methods of odd-even elimination and odd-even reduction, which are
inherently parallel algorithms closely related to the fast Poisson solvers,
are shown to be quadratic, and are discussed in Section 4. Odd-even elim-
ination is seen to be a variation of the quadratic Newton iteration for
A \ and odd-even reduction can be considered as a special case of Hageman
and Varga's general cyclic reduction technique [HI], as equivalent to
block elimination on a permuted system, and as a compact form of odd-even
elimination. We also discuss certain aspects of the fast Poisson solvers
as variations of the odd-even methods.
In Section 5 we show that the odd-even methods are the only quadratic
methods within a certain reasonable class of algorithms. This class in-
cludes all the previous methods as particular cases of a general matrix
iteration. The relationship between block fill-in and quadratic conver-
gence is also discussed.
L
7
In Section 6 the quadratic methods are shown to be special cases of
families of higher-order methods, characterized by || B[M^+^]|| ^ j| BTM^ ] 1 1 P-
We note that the p-fold reduction (Sweet's generalized reduction [S10]) is
equivalent to a 2- fold reduction (odd-even) using a coarser partitioning
of A.
Semidirect methods based on the quadratic properties of the odd-even
algorithms are discussed in Section 7. Briefly, a semidirect method is a
direct method that is stopped before completion, producing an approximate
solution. We analyze the approximation error and its propagation vis-a-vis
rounding error in the back substitution of odd-even reduction. Practical
limitations of the semidirect methods are also considered.
In Section 8 we consider some iterative methods for systems arising
from differential equations, emphasizing their utility for parallel com-
puters. Fast Poisson solvers and general algorithms for systems with
small blocks form the basis for iterations using the natural, multi-line
and spiral orderings of a rectangular grid. We also remark on the use of
elimination and the applicability of the Parallel Gauss iteration [T2],
[H4 ] .
A brief section on applications deals with two situations in which
the assumption |(B[A]j|< 1 is not met, namely curve fitting and finite ele-
ments, and in the latter case we describe a modification to the system
Ax = v that will obtain ||b[A]|| < 1.
Methods for solution of block tridiagonal systems are compared in Sec-
tion 10 using a simplified model of a pipeline (vector) computer. We con-
clude that a combination of odd-even reduction and LU factorization is a
powerful direct method, and that a combination of semidirect and iterative
methods may provide limited savings if an approximate solution is desired.
8
A general discussion of storage requirements and the effect of machine
constraints on vector operations is included; these are essential con-
siderations for practical use of the algorithms.
l.B. Notation
Let the block tridiagonal matrix
Here b. is an n. X n. matrix and a. = 0, c = 0. The subscript N in the
i l l I N
triplex notation will often be omitted for simplicity. Define
L[A] = (a., 0, 0) ,
D[ A ] = (0, b , 0),
U[A] - (0, 0, c ),
B[A] = I - (D[A])_1A = (-b"1 a , 0, -b"1 c ) ,
C[A] - I - A(D[A])” 1 = (-a b“}i, 0, -c jbj^) .
Use of these notations for general partitioned matrices should be evident;
nonsingularity of D[A] is assumed in the definitions of B and C. B[A] is
the matrix associated with the block Jacobi linear iteration for the solu-
tion of Ax = v. It also assumes an important role in the study of certain
semidirect methods and will be central to our analysis of direct methods.
9
We further define = diag ( 0 , . . . , 0 , I, 0,...,0), where the I is
Oj X n^ and occurs in the jth block diagonal position. Pre- (post-) mul-
tiplication by E. selects the jth block row (column) of a matrix, so that,
J N N
E . AE . , D[A]E. = E.U[A], and E. L[A ] E . = 0.
. , l i j 1 t,. . i j
i= 1 J J j=i J
for example, D[A] =
For an nXn matrix M and an n-vector v, let
Pi(M)
y. (M)
II M|| =
II Ml'
n
"j-i
n
'i*l
m. .
1 1
|m. . = p . (M ) ,
J-J1 J
maxi * c^e co"nonn>
max ^ v (M)
M
v|| = max. |v.J ,
n
i=l
Vi'
|M
( I mi j I ) » and
0 (M) = the spectral radius of M.
The reader should take care not to confuse row sums (p^(M)) with the spec-
tral radius (o(M)). When A = (a., b., c .) , let \(A) = max . (4 ||b . ^ a . || c-
111 j J J J J
The condition \(A) £ 1 will be used as an alternative to |j B[A]|| < 1.
-1
2. SOME INITIAL REMARKS
2 . A . Analytic Tools
Much of the analysis presented in later sections involves the matrix
B[A]. Why should we consider this matrix at all?
First, it is a reasonably invariant quantity which measures the diag-
onal dominance of A. That is, if E is any nonsingular block diagonal matrix
conforming with the partitioning of A, let A* 3 EA , so B* = B[A*] = I-D* *A*
= I - (ED) (EA) = I - D ^A = B. Thus B[A] is invariant under block scal-
ings of the equations Ax = v. On the other hand, B is not invariant under
a change of variables y = Ex, which induces A' = AE We have
B' = B[A' ] = EBE \ and for general A and E only the spectral radius 3(B)
is invariant. Similarly, C[A] measures the diagonal dominance of A by
columns, and is invariant under block diagonal change of variables but not
under block scaling of the equations.
Secondly, one simple class of parallel semidirect methods (Section 7)
can be created by transforming Ax = v into A*x = v*, A* = MA, v* = Mv for
some matrix M, where || B[A*]|| « || B[A]||. This transformation typically
represents the initial stages of some direct method. An approximate solu-
tion y = D[A*] *v* is then computed with error x - y = B[A*]x, so ||B[A*]||
measures the relative error. Thus we study the behavior of B[A] under
various matrix transformations.
Thirdly, B[A] is the matrix which determines the rate of convergence
of the block Jacobi iteration for Ax = v. This is a natural parallel itera-
tion, as are the faster semi- iterative methods based on it when A is posi-
tive definite. It is known that various matrix transformations can affect
the convergence rates of an iteration, so we study this property.
11
Finally, B[A] and C[A] arise naturally in the analysis of block elim-
ination methods to solve Ax = v. C represents multipliers used in elimina-
tion, and B represents similar quantities used in back substitution. In
order to place upper bounds on the size of these quantities we consider
the effects of elimination on || B[A]||.
Since we will make frequent use of the condition jj B[A]|| < 1 we now
discuss some conditions on A which will guarantee that it holds.
Lemma 2.1. Let J = + J? be order n, |j| = | J^j + | J? | , and suppose K
satisfies K = J^K + J9 . Then || j|| < 1 implies p^CK) £ P^J) for 1 £ i £ n,
and so || K || £ j| j||.
Proof . From |j^| + j J0 1 = |j| we have p^J^) + P^C^) = Pj_(J) for 1 £ i £ n.
From K = J^K + J we have p^(K) £ o (J^) || K | j + oi(J?). Suppose |j k|| s 1.
Then P^K) £ C ^ ( JT^) || k|| + P.^^) II Kll = II K|| s || J|| || K||> which implies
|| K || £ || J j| j| K jj and 1 £ || j||, a contradiction. Thus j| k|| < 1 and
Pl(K) £ O.^) + p^J.,) = Pt(J). QED.
We note that if o ^ ( J * 0 then we actually have p^K) < p^(J) and
if p.(J,) = 0 then p . (K) = o.(J„) = p,(J). It follows that if each row of
l 1 iv t 2 i
contains a nonzero element then || K|| < || j||. These conditions are rem-
iniscent of the Ste in-Rosenberg Theorem ([V3], [H8]) which states that if
J is irreducible, and nonnegative, J9 * 0 and p(J^) < 1 then p(J) < 1
implies p (K) < p(J) and p(J) = 1 implies p(K) = o(J). There are also cor-
respondences to the theory of regular splittings [V3].
Lemma 2.2. Suppose J = (I - S)K + T is order n, |j| = | (I - S) K | + |t|,
P-^S) £ P^(T) , 1 £ i £ n. Then || j|| < 1 implies p^K) £ p^J) , 1 £ i £ n,
and so f| K|| £ || j|j.
12
Proof. For 1 £ i £ n, p^T) £ p.(T) + p.((I - S)K) = P^J) £ || j|| < 1,
so p.(T) < 1. Now, 1 > 0i(J) = p.((I - S) K) + Pt(T) S pt(K) - p.(S)||K||
+ p.(T) & P.(K) + P.(T)(1 - || K|| ). Suppose || K|| = p^(K) > 1. Then
I > pj(K) + p (T)(l - pj(K)), and we have 1 < p^(T), a contradiction.
Thus || K || ^ 1 and pi(J) 2 Pi(K). QED.
If S - T in Lemma 2.2 then K=»SK+ (J- S), |j| = | (I - S)K|+ |s|
= | J - S j + | S | , so by Lemma 2 . 1 we again conclude || K|| £ || j||.
Similar results hold for the 1-norm.
Lemma 2.3. Let J * J + J,, , |j| 3 |j | + | J0 | , and suppose K satisfies
K - KJX + J . Then || J 1^ < 1 implies || K||x ^ || j||x .
Proof. Since JT = + J^, |jT| => | J* | + | J* | , KT = J^KT + J, and
II Ml^ = || MT||, we have || JT|| = || < 1, and by Lemma 2.1, || - || KT||
* (I JT|( " (1 J|lr QED-
Lemma 2.4. Suppose J = K(I - S) + T, | j| = | K(T - S) | + |t| , v^S) £ y^T) ,
1 * i £ n. Then || j||. < 1 implies || k|L * || j|| .
Proof. We have v^M) = P^fM ), = (I - ST)KT + TT, so by Lemma 2.2
|| K^|| £ || J^||, and the result follows. QED.
To investigate the condition || B[A]||< 1, we first note that if A is
strictly diagonally dominant by rows then j| B[A]|| < 1 under any partition-
ing of A. Indeed, if we take J to be the point Jacobi linear iteration
matrix for A, J 3 I - diag(A) ^A, * D[J], Jj * J “ then
|| J1|| * || J|| < 1, |J| = | JL | + | J2| and B[A] - (I - so
|| B[A]|| * i| J|| < l by use of Lemma 2.1.
Now, if we denote the partitioning of A by the vector II = (n^ ,n,, , . . . ,n^)
then the above remark may be generalized. Suppose H' = (m^ ,ra9 , . . . .m^) rep-
resents another partitioning of A which refines II. That is, there are inte-
Pi_1
gers 1 = pQ < p^ < . . , < pN = M4-1 such that ni = m.. If B[A ] and
1 ~ J = P i- 1 J
B'[A] are defined by the two partitions and j| B'[A]|| < 1 then [| B [ A ] 1 1
£ || B'[A]||. This is an instance of the more general rule for iterative
methods that "more implicitness means faster convergence" (cf. [H8, p. 107])
and will be useful in Section 6. Briefly, it shows that one simple way to
decrease || B[A]|| is to use a coarser partition.
Let us consider a simple example to illustrate these ideas. Using
2
finite differences, the differential equation -u - u + Xu = f on T 0 , 1 ] ,
n xx yy l > j >
X > 0, with Dirichlet boundary conditions, gives rise to the matrix
9
A = (-1, M, -I), M = (-1, 4+Xh", -1). By Lemma 2.1, || B[A]|| £ || J[A]|| =
2
4/ (4+Xh ) < 1. Actually, since B[A] is easily determined we can give the
better estimate || B[A]|| = 2 1 1 M ^ || < 2/ (2+Xh~) .
Indeed, many finite difference approximations to elliptic boundary
value problems will produce matrices A such that || B[A]|| < 1 [ V3 ] ; |{ B[A]||
will frequently be quite close to 1. This condition need not hold for
certain finite element approximations, so a different analysis will be
needed in those cases. Varah [Vl], [V2] has examples of this type of
analysis which are closely related to the original continuous problem. In
Section 9.B we consider another approach based on a change of variable.
|| B[ A ] 1 1 < l is a weaker condition than strict block diagonal dominance
relative to the “-norm, which for block tridiagonal matrices is the condi-
tion ([FI], [Vl]) || b j 1 1| (|| a j || + || c j 1 1 ) < 1, 1 * j s u. The general
definition of block diagonal dominance allows the use of different norms.
so for completeness we should also consider theorems based on this
14
assumption. However, in order to keep the discussion reasonably concise
this will not always be done. It is sometimes stated that block diag-
onal dominance is the right assumption to use when analyzing block elimi-
nation, but we have found it more convenient and often more general to
deal with the assumption || B[A]||< 1.
Other concepts related to diagonal dominance are G- and H-matrices .
If J = I - diag(A) ^A= L + U, where L (U) is strictly lower (upper) trian-
gular, then A is strictly diagonally dominant if || j|| < 1, a G-matrix if
|| (I - j L | ) 1 |U| || < 1, and an H-matrix if p(|j|) < 1. Ostrowski [02]
shows that A is an H-matrix iff there exists a nonsingular diagonal matrix
E such that AE is strictly diagonally dominant. Varga [V4] summarizes
recent results on H-matrices. Robert [R3] generalizes these definitions
to block diagonally dominant, block G- and block H-matrices using vector-
ial and matricial norms, and proves convergence of the classical iterative
methods for Ax = v. Again, for simplicity we will not be completely gen-
eral and will only consider some representative results.
2 .B . Models of Parallel Computation
In order to read Sections 3-9 only a very loose conception of a vector-
oriented parallel computer is necessary. A more detailed description is
needed for the execution time comparison of methods in Section 10. This
subsection contains sufficient information for the initial part of the
text, while additional details may be found in Appendix A, [H3], [S3], and
the references given below.
Parallel computers fall into several general classes, of which we con-
sider the vector computers, namely those operating in a synchronous manner
and capable of performing efficient floating point vector operations.
15
Examples are the array processor Illiac IV, with 64 parallel processing
elements [B6], [B12]; the Cray-1, with eight 64-word vector registers [C8];
and the pipeline processors CDC STAR- 100 [C2] and Texas Instruments ASC
[Wl], which can handle vectors of (essentially) arbitrary length. These
machines have become important tools for large scientific computations,
and there is much interest in algorithms able to make good use of special
machine capabilities. Details of operation vary widely, but there are suf-
ficiently many common features to make a general comparison of algorithms
poss ible .
An important parallel computer not being considered here is the asyn-
chronous Carnegie-Mellon University C.mmp [W4], with up to 16 minicomputer
processors. Baudet [B7] discusses the theory of asynchronous iterations
for linear and nonlinear systems of equations, plus results of tests on
C . mmp .
Instructions for our model of a vector computer consist of the usual
scalar operations and a special set of vector operations, some of which
will be described shortly. We shall consider a conceptual vector to be
simply an ordered set of storage locations, with no particular regard to
the details of storage. A machine vector is a more specialized set of
locations valid for use in a vector instruction, usually consisting of
locations in arithmetic progression. The CDC STAR restricts the progres-
sion to have increment = 1, which often forces programs to perform extra
data manipulation in order to organize conceptual vectors into machine
vectors .
We denote vector operations by a parenthesized list of indices. Thus
the vector sum w = x+y is
16
w. = x. + y . , (1 s i s N)
1 t i
, N
if the vectors are in R , the componentwise product would be
Wi = Xi X yi’ C1 s ^ N> »
and a matrix multiplication C = AB would be
Cij=_^1aik v (1 "N; 1 sn)-
Execution time for a vector operation depends on the length of the vector
and the internal mechanism of the vector computer, but can usually be
written in the form T N + c if the vector length is N. On a pipeline
op op r r
computer, 1/ t is the asymptotic result rate and a is the vector instruc-
op op
tion startup cost. On a k-parallel computer (one with either k parallel
processors or k-word vector registers) the basic time would be t for k
op
operations in parallel and the vector instruction cost t N k~ + overhead.
op
In either case the tN + a model is adequate for vector operations; scalar
operation times are given as t
op
Simplified instruction execution times
for the CDC STAR- 100 are given below, in terms of the 40 nanosecond cycle
time. Usually T « t « a
op op op
operation
scalar time
vector
time
add, subtract
13
b*
71
multiply
17
N +
159
divide
46
2N +
167
square root
74
2N +
155
transmit
11
91
maximum
-
6N +
95
summation
-
6.5N
+ 122
inner product
-
6N +
130
17
The use 01 .ong vectors is essential for effective use of a vector
computer, in order to take full advantage of increased result rates in
the vector operations. The rather large vector startup costs create a
severe penalty for operations on short vectors, and one must be careful
to design algorithms which keep startup costs to a minimum. Scalar opera-
tions can also be a source of inefficiency even when most of the code con-
sists of vector operations. Data manipulation is often an essential con-
sideration, especially when there are strong restrictions on machine vec-
tors. These issues and others are illustrated in our comparison of algo-
rithms in Section 10, which is based on information for the CDC STAR- 100.
18
3. LINEAR METHODS
The simplest approach to direct solution of a block tridiagonal linear
system Ax = v is block elimination, which effectively computes an LU factor-
ization of A. We discuss the process in terms of block elimination on a
general partitioned matrix under the assumption j| B[A]|j< 1; the process is
shown to be linear and stable. Various properties of block elimination are
considered, such as pivoting requirements, bounds on the condition numbers
of the pivotal blocks, error propagation in the back substitution, and
special techniques for special systems. Gauss- Jordan elimination is studied
for its numerical properties, which foreshadow our analysis of the more impor-
tant odd-even elimination and reduction algorithms.
3 .A. The LU Factorization
3.A.I. Block Elimination
The block tridiagonal LU factorization
A = (aj,b.,c.)N = (i.,I,0)N(0,d.,c.)N
is computed by
d = b
1 1
,-l
dj = bj - ajdj-l cj-l> J = 2.....N,
l3 = Vj-l’ j =
where the existence of d^. ^ is assumed and will be demonstrated when || B [ A ] 1 1 < 1.
For the moment we are not concerned with the particular methods used to compute
or avoid computing either Z ^ or d^ ^ cj-l"
19
Let A^ = A and consider the matrix A0 which is obtained after the first
block elimination step. We have A0 = (a.,3.,cJN> 3^ = d^ = b^, a ^ = 0,
32 = d2’ aj = aj and = bj for 3 s Let Bx = B[A1], B2 = B[A,].
Note that and differ only in the second block row; more specifically,
in the (2,1) and (2,3) positions.
Theorem 3.1. If J| B ^ || < 1 then || B2 || £ || B || .
Proof. Let T = B^E^ and let S = TB^. We have d2 = b9(I - b0^a0b^c ),
II b2 la2bllcl II = H SH ^ II B1 II II Ei II II Bill = II Bil|2 < 1> so d21 exists and B„
is well-defined. Also, B^ = (I - S)B, + T, { B^ | = j (I - S) B, | + j T | by con-
struction, and o^(S) £ 0 j (T) 1 1 B^|[ ^ P^(T). Thus Lemma 2.2 applies, yielding
QED
Now, for any partitioned matrix the N stages of block elimination (no
block pivoting) can be described by the process
for i = 1 N- 1
Ai+1 - (I ' L[Ai]EiD[A1]‘1)Ai
= (I + L[C[Ai]]Ei)A..
We transform A^ into by eliminating all the blocks of column i below the
diagonal; when A is block tridiagonal A^ is the upper block triangular matrix
(0,dj,Cj)^. The same effect can be had by eliminating the blocks individually
by the process
20
Aj = A
for i = 1, . . . ,N- 1
Af1! = A
l+l l
for j = i+l,...,N
A^j) = (I - E.Af-^E.DCA^"^ ]'1)A^"1)
l+l j i+l l l+l J l+l
L = (I + EjC[Ai+r ]Ei
i+l
L
A . . = a<N>
i+l i+l
Each minor stage affects only block row j of the matrix. It is seen that
block row j of A^~ ^ = block row j of A.,
i+l J l
block row i of Ap,1"* = block row i of A..
i+l l
from which it follows that
E.A^YofAy- Y1 = E.A.E.D[A.]-1,
j i+l l i+l J j l l L iJ ’
so the major stages indeed generate the same sequence of matrices A^. Let
B. = B[A.].
Corollary 3.1, If A is block tridiagonal and j| B^|| < 1 then || B.+1|| ^ |j B. ||,
1 £ i SN-1,
The proof of Corollary 3.1 requires a bit more care than the proof of Theorem
3.1 would seem to indicate; this difficulty will be resolved in Theorem 3.2.
We should also note the trivial "error relations"
*i+i - 'Hill s ll*i - *JI.
Bi+1- 'nII sliBi- BJ-
21
These follow from the fact that, for block tridiagonal matrices,
i N
A.
1
E .
A
+
~j=i+l
EjAr
, which holds regardless of the value of j| B ^|| .
Wilkinson [W3] has given a result for Gaussian elimination that is similar
to Corollary 3.1, namely that if A is strictly diagonally dominant by columns
then the same is true for all reduced matrices and || A^+^ (reduced) ||^ £
|| A^ (reduced) j|^. This actually applies to arbitrary matrices and not just
the block tridiagonal case. Brenner [B14] anticipated the result but in
a different context.
Our next two theorems, generalizing Corollary 3.1 and Wilkinson's observa-
tion, lead to the characterization of block elimination as a norm- non- increas ing
k
projection method , and proves by bounding the growth factors that block
elimination is stable when ))b[A]|| < 1. In addition, many special schemes for
block tridiagonal matrices are equivalent to block elimination on a permuted
matrix, so the result will be useful in some immediate applications.
Theorem 3.2. If )| B 1|| < 1 then || B.+1|| <; || B.|] and j| At+1|| ^ || A.|| for
1 £ i S8-1.
Proof. Suppose || B^||< 1. We have
A1+1 3 (I - L[A.]EiD[Aif1)Ai
- A. - L[A.]E. + L[A1]E1B1,
PJ(Ai+l>* Pj(Ai ' + Pj(L[Ai]EiB.)
s P j(Ai - L[A.]E.) + oj(L[Ai]Ei)|| B.||
k
The projections are onto spaces of matrices with zeros in appropriate positions.
* Pj(A. - L[A.]E.) + PjCLCA.JE.)
= P.(A.),
S° II Ai+lH * II Ai II • Now let
Q = D[Ai]‘1L[Ai]EiD[Ai]'1A.>
30 Ai+1 = Ai ' °fAi^» D[Ai+l] = D^AiKI - D[Q ]) •
Let S = D[ Q ] - Since
Q = (-L[B.]E.)(I - B.) = -L[B.] + L[B.]E.B.,
S = D[L[B.]E.B.] and ||s||* ||L[B.]|| ||e.|| ||B.||s ||B.j|2 < 1. Thus
D[A^+^] exists and B^+^ is well-defined. It may be verified that
(I ' = Bi - S + Q. Let T = Bl - S + Q, J = S + T = B + Q. Since
D[ Bjl ] = D[Q - S] = 0, we have D[T] = 0 and |j| = |s| + |t|. Thus, by Lemma 2.1,
j| j|| < 1 implies || Bi+1|| ^ j| j|| . But we have, after some algebraic manipula-
tion.
J = B. + Q
i
Ci EA
yN
+ / E.B, (I - E. + E.B.) ,
Tc-i+1 k 1 1 11
and it follows from this expression that || j|| £ || B^|| . QED
Although expressed in terms of the N- 1 major stages of block elimination,
it is clear that we also have, for B^ = B[A^ ], || B^ || £ || B^"1^ || and
llAij)|| 5 || X) ||. Moreover, as long as blocks below the diagonal are
eliminated such inequalities will hold regardless of the order of elimination.
23
We shall adopt the name "linear method" for any process which generates
a sequence of matrices such that || B[M^+^][| s || B[M^]||, given that
|| B[M^]||< 1. Thus block elimination is a linear method.
A result that is dual to Theorem 3.2 is the following.
Definition. Let CL~ be the reduced matrix consisting of the last N-i+1 block
rows and columns of A^. = A^ = A) .
Theorem 3.3. If || CfA]^ < 1 then || C[<2.+1 ] || l * || C[£2. ] ||r \\^i+l\\l ^ H^llj.
and || L[C1]E.||1 * || CCA]^, C. = C[A.].
Remark. It is not true in general that ||c || £ | j c ^ 1 1 f HCilli< 1’
Proof. Let the blocks of C . be c , 1 £ ra, p £ N. Then C\&. 1 = (c ),-<s„
i mp L mp
mp
Let K = (cmp + craicip) i+l^n,p^
= ( r ) , , •+■
' mp i+ls<n,p^I
(ci,i+l CiN)-
Using Lemma 2 .4 ,
But
^i+iHl!
if : k L < l.
VK>*V E C6J? ])
4 H ~k=i+l
i+l»i\
'Ni
l^qCEf C82,])
+ II cBS1]||1vq(EiCK?i])
s v-C1+i E*
r
24
1
+ vq(Et CE^])
= vq(C52.]).
Thus II k||x ^ ||c[<2i]||1< 1 and || ct^i+1] II ± 5 || C^] || 1 . By construction
this implies || £ || C [ A ] 1 1 ^ for 1 £ i £ N. Now, let the blocks of
be a , 1 ^ m, p s n. Then CL. , = (a + c .a. and
mp l+l mp mi ip i+l^n,p^l
s V^+1 Etf>
llVq(Ei°l)
' X, «V ■
QED
We note that if A is symmetric then C[A] = B[A ] and Theorems 3.2 and 3.3
become nearly identical. Of course, B and C are always similar since C =* DBD
When A is positive definite, Cline [ C 3 ] has shown that the eigenvalues of 42
interlace those of 42^ Actually, the proof depends on use of the Cholesky
factorization, but the reduced matrices are the same in both methods. In
consequence of Cline's result we have ||^Z || 5 •
The statements of Theorems 3.2 and 3.3 depend heavily on the particular
norms that were used. It is not true in general, for example, that if
||B[A1]||1< 1 then || B[A2] ||j £ || B[Aj] ||j or even || B[£, ] ^ s || B[4^ ] ||r This
makes the extension to other norms quite difficult to determine.
25
1
3. A. 2. Further Properties of Block Elimination
It is well-known that the block factorization A - LAU is given by
N
L - I - ; L[c ]E ,
S-i J J
A = D[A^1,
N
U = I - B =* I - / E. U[B ].
N “j-l J J
We have therefore shown in Theorems 3.2 and 3.3 that if || 1 then
II L||x * 1 + II C1|| J_ <2, || L|| * 1 + Nn|| C1||1, n = maXj n . , and if || B1|| < 1
then || U|| * 1 + || B x j| <2, || u|| l £ 1 + Nn|| bJI . (For block tridiagonal
matrices the factor N can be removed, so |j L|| £ 1 + nj| C ||^, || £ 1 + n|| B^j|.)
Moreover, the factorization can be completed without permuting the block rows
of A. The purpose of block partial pivoting for arbitrary matrices is to
force the inequality || L[ C ^ ]E ^ 1 1 £ 1. Broyden [B15] contains a number of
interesting results related to pivoting in Gaussian elimination. In particular
it is shown that partial pivoting has the effect of bounding the condition
number of L, though the bound and condition number can actually be quite
large in the general case. We will have some more comments on pivoting after
looking at some matrix properties that are invariant under block elimination
(cf. [311]).
Corollary 3.2. Block elimination preserves strict diagonal dominance.
Proof . It follows from Theorem 3.2 that Gaussian elimination preserves strict
diagonal dominance: simply partition A into lxl blocks. Now suppose we have
A^ and are about to generate A This can be done either by one stage of
block elimination or by n^ stages of Gaussian elimination, for both processes
produce the same result when using exact arithmetic [HS, p. 130]. Thus if A^
is strictly diagonally dominant, then so is QED
r
26
It is shown in [B4] that Gaussian elimination maps G-matrices into
G-matrices. This can be extended in an obvious fashion to the case of block
elimination. We also have
Corollary 3.3. Block elimination maps H-matrices into H-matrices.
Proof. Recall that A is an H-matrix if and only if there exists a nonsingular
diagonal matrix E such that A' = AE is strictly diagonally dominant. Con-
sider the sequence
A^ = A' ,
A^+1 = (I - L[A'.]E. D[A’]-1)A^.
Clearly A^ = AjE; suppose that A^ = A^E is strictly diagonally dominant. We
then have
Ai+1 = (I " L[Ai]E EiE“1D[Aif1)A1E
= (I - L[At]E. DrA.f^A.E
= Ai+1E’
so A^+^ is an H-matrix since A^_^ is strictly diagonally dominant by Corollary
3.2. QED
We have chosen to give this more concrete proof of Corollary 3.3 because
it allows us to estimate another norm of B[A] when A is an H-matrix. Define
I|m|!e= || E ^ME 1 1. Then since A' = AE implies B[A' ] = E ^B[A]E it follows
that if A is an H-matrix then || BCAi+1]||E * || BCA^H . Since the matrix E
will generally remain unknown this is in the nature of an existential proof.
The proof of Corollary 3.2 brings out an important interpretation of
Theorem 3.2. If block elimination is actually performed then our result
gives direct information about properties of the reduced matrices. If not,
and ordinary Gaussian elimination is used instead, then Theorem 3.2 pro-
vides information about the process after each n^ steps. While we are not
assured that pivoting will not be necessary, we do know that pivot selec-
tions for these n. steps will be limited to elements within the d. block
for the block tridiagonal LU factorization. When A has been partitioned to
deal with mass storage hierarchies this will be important knowledge (see
Section 10. A).
One more consequence of Theorem 3.2 is
Corollary 3.4. In the block tridiagonal LU factorization, if Jj B^|j < 1 then
cond(dj) = || d || || d" 1 1| < 2 cond(b.) / (1 - || B1||2 ).
Proof. Letting a , = b . *a .d . \c . . , we have d. = b.(I - a.), ||a\|| £
J J J J-l J-l J J j 11 j"
II Bx|| || Bn|| * || Bjf < 1, so that || d || s || b || (1 + || o || ) < 2 || b ||, and
* b
(I- a.)"1!!* Hb-.1!!/ (1- Hcjll) ^ || b”1 1| / (1- Hb^I2).
It is important to have bounds on cond(d^) since systems involving d^
must be solved during the factorization. Of course, a posteriori bounds on
|| d.^|| may be computed at a moderate cost. We note also that
cond(A) ^ maXj 2 cond(b^) / (1 - || B^H ) when || B^|| < 1, and it is entirely
possible for the individual blocks b. and d. to be better-conditioned than A.
J j
The inequalities in Theorem 3.2 are the best possible under the condition
|| B^ll < 1. This is because, for example, the first block rows of B^ and B?
are identical. However, it is possible to derive stronger results by using
different assumptions about B.. .
V
28
i
Theorem 3 .4 . [H4 ] For A = (a^ ,b ,c ^) , let \(A) = max^ (4 || *a^ || ||b^ ^c ^ || ) ,
e j = b'V. If X(A) S 1 then || I - e j || £ (1 - Jw\)/2 and || 1 1| <: 2/(1 + v^) .
Some useful consequences of this theorem include better estimates for
II B., || and a smaller upper bound on the condition numbers of the d.'s. For we
N J
have B^ = (0,0,-dj^cJ = (0,0, -e^b^ ^c^) , which implies that
|| B^jU ^ 2 || U[B^ ] || / (1 + v/1- X) . As an example , if A = (1,3,1) then Theorem
3.2 predicts only |) B^H ^ || B^|| = 2/3, while by Theorem 3.4 we obtain
|| Bj| £ 2(1/3) / (1 + ,/l-4/9) * 0.382. Also, || d“ 1 1| £ || d~ Lb ^ || ||b~ 1 1| =
l|a'j1|| l|b j 1 II S -||b j 1 ||» and ||d. || * |^- d . || + |^ || - INjd “ -^ || + |^ || ^
ll b j || || I - e ^ |j + || b. j| £ 3 || b^l/2. It follows that cond(dj) = || d" X|1 ||d j || s
3 cond(bj) . More precisely, cond(dj) ^ (1 + 2 u ( A.) ) cond(bJ , u * ( 1 - • '1- \)/( 1 + »/l- X) .
This is further proof of the stability of the block LU factorization.
Theorem 3.4 has been used in the analysis of an iteration to compute the
block diagonal matrix (0,d^,0)^ [ H4 ] . Such methods are potentially useful
in the treatment of time- dependent problems, where good initial approximations
are available. This particular iteration is designed for use on a parallel
computer and has several remarkable properties; for details, see [H4] and
Section 8.E.
Two results on the block LU factorization for block tridiagonal matrices
were previously obtained by Varah.
Theorem 3.5. [ Vl ] If A is block diagonally dominant (|| b^||(||a^||+ | (c ^ 1 1) s 1)
and || Cj|| J* 0, then || d~ 1 1| * || c^f1, || || s || || / || and
11^11*11^11+ II aj II*
29
In comparison to Theorem 3.2, the bound on || d^|| corresponds to the
bound || < w^ich follows from || B^U < 1, and the bound on || d^ ||
corresponds to the bound ||a^|| ^ ||a||. In contrast to Theorem 3.2, j| B || < 1
“1 "1 2
only allows us to conclude || d^ || 5 || b^ || / (1 - j| B^|| ), and we can only
estimate || ^ || in the weak form || ijcj_1|| * || a. || || dj^c^JI 5 || « ||.
Varah's second result is
Theorem 3.6. [Vl] Let a. = (|| bj^ajll II ^j\Cj ill 2 £ j £ N, and sup-
pose a. P 0. Then the block LU factorization of A is numerically stable
J
(i.e., there exists a constant K(A) such that max(|| l || , || d^ || ) £ K(A)) if
(cty 1 , o' j is positive semidef inite.
We note that (<*^,1,3. is a symmetrization of ( 1 1 b . ^a ^ ||, 1 , | j b ^ ^c ^ 1 1 )
and that \(A) £ 1 implies that it is positive definite. The actual bounds
on || 2^ ||, || d j 1 1 given by Varah [VI] for this case will not be important for
this discussion.
Since many block tridiagonal systems arise from the discretization of
differential equations, a number of results have been obtained under assump-
tions closely related to the original continuous problem. Rosser [R7] shows
that if the system Ax = v, A = (aj,bj,Cj), satisfies a certain maximum
principle then A is nonsingular, the LU factorization exists (each d^ is non-
singular) , B^ is non-negative and ||b^ || S 1. Improved bounds on B^ are
obtained when A is defined by the nine-point difference approximation to
Poisson's equation. In particular there is the remarkable observation that
the spectral condition number of d^ is less than 17/3. Such results depend
strongly on the particular matrix under consideration. For examples of
related work, see [B18], [Cl], [D2].
30
3. A. 3. Back Substitution
So far we have only considered the elimination phase of the solution of
Ax = v. For block tridiagonal matrices A = (a^.b^.c^^ this consists of
the process
di = V fi = v
d. = b - a.d.^.c. f. = v. - a.d.^.f.
J j J 3-1 3-1 3 3 3 3-1 j-1*
3 = 2 , . . • ,N.
The back substitution phase consists of the process
XN = dN fN’
X3 = dj1(f3 " CjXJ+l)* j =
In matrix terms these are the processes
Al - A, f(1) = v,
A.+1 - (I + L[C.]E.)A. , i =* 1,...,N-1,
f(L+1) = (I + L[C.]E.)f(L) , i = 1,...,N-1,
(N) „r» n-lr(N>
g = x D[A^] fv ,
x(l) =» g + EjB x(l+1), i =* N- 1 1 ,
X 53 X
(1)
An alternate form of back substitution is
x(N) - D[AN]’1f(N)
.(1)
X 3 X
(I + BNE. + 1)x(i+1), i = N- 1 , . . . , 1 ,
(1)
31
In either case it is seen that plays an essential role in studying the
back-substitution process.
For those situations where a band elimination method for Ax = v might be
preferred, the matrix maintains its importance for back substitution. We
have A = (a.,b.,c.) = (a . , l . ,0) (0,u . ,c .) , d. = i.u., a. = a .u . _, c. = l, c.,
JJJ J J JJJ JJJ J J-l J JJ
A A
where .^(u^) is lower (upper) triangular. The elimination phase generates
* * 1 * - 1
vectors f. = 2.f., and the back substitution computes x^ = u^ f^ = f ,
A-l * A . -1
x. = u. (f. - c.x.,,) = d. (f. - c.x.,..). Except for the specific form of
J J V J J J+l J J J J+l
roundoff errors the band and block back substitutions must behave identically.
Let us briefly consider the effect of rounding errors in back substitu-
tion; the computed solution will be y . = x . + * and we require bounds on f .
Suppose the computed forward elimination yields d^ = d^ + 5 , f^ = f^ -4- cp ^ .
JU
Let 7). represent the errors introduced in forming f. - c.y. ,, so that y.
J J J J+l J
* (2) *
satisfies (d . + 6. )y. = f. - c.y. , + T|. if ordinary Gaussian elimination
J J J J J J+l J
*
is used to solve for y. in d. y. = f. - c.y.,, + T|.. It follows that
J J 'J J J J+l J
v. = d.^(f. - c.y.,,) + e., where e. = d.^(®. + 1. - (6^ + 6^ )y.), which
J J J J J+l j j J J j J j j
implies = - djlcj?j+l- Thus H Sjll 5 II BNH II ?j+lH+ II ejH» and the
previous errors tend to be damped out in later computations if ||b^||< 1. If
we have a uniform upper bound for the e^'s, || || £ e, and II Bj,|| < 1, then
we
obtain J! Sj|| 5 (N - j + 1) e> || J] * d - || j+1) e/ d - \^\\> < •/ (1- |)bn|J)
3. A. 4. Special Techniques
In some cases the matrix A = (a^b^c^) Possesses special properties
that may be important to preserve in order to satisfy a strong stability
criterion. For example, we might have b
j
t . - a , - c . where
J j J
t j || is small
or zero; this requires that all blocks be n X n. Scalar tridiagonal systems
J
r
32
with this property arise from the discretization of second order ODE boundary
value problems. Babuska [Bl-3] has discussed a combined LU - UL factoriza-
tion for such a matrix which allows a very effective backward error analysis
to the ODE and a stable numerical solution of a difficult singular pertur-
bation problem posed by Dorr [ D3 ] . (See [H7] for a symbolic approach to
this problem.)
There are two essential features to Babuska' s method. The first is
to express b^ and d^ indirectly by starting with t ^ and computing the se-
quence s^ = t^, Sj = tj - ]_• By this we have d = s. - c^. The
second feature is to augment the forward elimination with a backward elim-
ination and dispense with the back substitution by combining the results
of elimination. Specifically, suppose A = (— p j , t ^ + p^. + q^.-q^)^.
Pj.q^.tj £ 0, p^ = q^ = 0, A nonsingular. The algorithm for Ax = v is
V fl
V SN
V fN = V
-1
Sj = Cj + Pj(Sj-l+ qj-l} SM
£j ' vj + pj<sj-i +
★ ★ _ \ ★
s, = t + q .(s , , . + p^.) s
j
*j 'j+l rj+i “j+i
f. ■ v + q . (s . . + p. .
J j V J+l KJ+1 J+l
j 2 , . . . ,N,
j - N-
(s. + s* - tj)_1(fj + f * - v.), j = 1, . . . ,N.
In matrix terms, we factor A = A^ + AQ + A^ = (I + L) (D + Ay) ■ (I + U ) (D + A^)
and solve (I + L) f ■ v, (I + U ) f * v. Adding (D + A^)x = f, (D + A^)x = f
★ ★ "k
and -Aqx “ -Aqx we obtain (D + D - Ap)x = f + f - Ax = f + f -v.
J
33
Babuska's a^^orithm has some unusual round-off properties (cf. [M3])
and underflow is a problem though correctable by scaling. It handles small
row sums by taking account of dependent error distributions in the data,
and thus fits into Kahan's notion that "conserving confluence curbs ill-
condition" [Kl].
To see how the method fits in with our results, we note that, for n = 1,
since p.,q.,t i 0 we have s £ 0, s = t. + p.d'^s. , £ t.. Thus the row
J J J 1 J J J J-l J-l J
sums can only increase during the factorization. On the other hand, they
will not increase much since s.“t.+p.(s,,+q, ,)^s . £ t. + p..
J J J J-l J-l j-l J KJ
These are analogues of the results in Theorem 3.2.
3 . B. Gauss-Jordan Elimination
An algorithm closely related to block elimination is the block Gauss-
Jordan elimination
Aj = A,
Ai+i - (I + C[A.1E.)A., i = 1, . . . ,N.
This avoids back substitution in the solution of Ax = v since A^_^ is block
diagonal. The rest of the Gauss-Jordan process is
-f(D
f (1+1) - (i +
C[A. ]E.)fa)
1 ) • • •
Theorem 3.7. Let B. - B[A,]. If || B^l < 1 then ||A.+1|| £ || A. || and
II ®i+ill s II II for L =
34
Sketch of Proof. The proof parallels that of Theorem 3.2. Note = B^;
suppose || B^ || < 1. Let F^ = L[A^] + UfA^], so that A^+^ = (I - F.E.D[A ]”*)A..
= A. - F.E. + F.E.B.. This will show the first part. Now define
i ii ill.
Q = D[Ai]‘1F.E.D[A.]'1Ai, so that A.+1 = A. - D[A.]Q, D[A.+1] = D[A.](I - S) ,
S = D[Q ] . We have Q = (- B.E.)(I - B.) = - B.E. + B.E.B., so S = D[B.E.B.]
1 i l ii ill ill
and || S || £ || B.J| < 1. Also, (I - S)B.+1 =■ B. - S + Q. Let T = L - S + Q,
J = S + T, so that | J | = j S | + | T | , and Lemma 2 . 1 applies if || J j| < 1. But
J = B^(I - E^ + E^B^) , so || J 1 1 ^ || B ^ 1 1 . From Lemma 2.1 we conclude
QED
When applied to block tridiagonal matrices the block Gauss-Jordan algo-
rithm requires many more arithmetic operations than the block LU factoriza-
tion, and cannot be recommended on this basis. However, it has some numer-
ical properties that foreshadow our analysis of the more important odd-even
reduction algorithm.
The Gauss-Jordan elimination on block tridiagonals may be expressed in
the following way:
for j = 1, . . . ,N- 1
r : 1
We observe that, for 1 S i <j+k SN, -d~ 1e . , . =• u .u u , , , ,
J J,j+k J J+j. j+k-T
which is the (j,j 4- k) block of B^. Thus we are able to refine Theorem 3.7
by giving closer bounds on the first i-1 block rows of B..
l
Corollary 3.5. If ||b^|| <1 and (point) row L is contained in block row
j with 1 ^ j S i-1, then
If (point) row l is contained in block row j with i £ j £ N, then
VV = W * II Bill •
This is our first result on the convergence to zero of portions of a
matrix during the reduction to a block diagonal matrix. It also provides
a rather different situation from Gauss-Jordan elimination with partial
pivoting applied to an arbitrary matrix [PI]. In that case the multipliers
above the main diagonal can be quite large, while we have seen for block
36
tridiagonal matrices with j| B^|| < 1 that the multipliers actually decrease
in size.
3 .C . Parallel Computation
We close this section with some remarks on the applications and inher-
ent parallelism of the methods; see also [H3]. When a small number of
parallel processors are available the block and band LU factorizations
are natural approaches. This is because the solution (or reduction to tri-
angular form) of the block system d^u. = -c . is naturally expressed in
terms of vector operations on rows of the augmented matrix (d^|-Cj), and
these vector operations can be executed efficiently in parallel. However,
we often cannot make efficient use of a pipeline processor since the vec-
tor lengths might not be long enough to overcome the vector startup costs.
Our comparison of algorithms for a pipeline processor (Section 10. C) will
therefore consider the LU factorization executed in scalar mode. We also
note that the final stage of Babuska's LU-UL factorization is the solution
of a block diagonal system, which is entirely parallel.
On the other hand, the storage problems created by fill-in with these
methods when the blocks are originally sparse are well-known. For large
systems derived from elliptic equations this is often enough to disqualify
general elimination methods as competitive algorithms. The block elimina-
tion methods will be useful when the block sizes are small enough so that
they can be represented directly as full matrices, or when a stable implicit
representation can be used instead.
37
4. QUADRATIC METHODS
In Section 3 we have outlined some properties of the block LU factor-
ization and block Gauss-Jordan elimination. The properties are character-
ized by the generation of a sequence of matrices >L such that
|| B[Mi+^]|| ^ || B[1L]||. From this inequality we adopted the name "linear
methods". Now we consider some quadratic methods, which are characterized
by the inequality || B[Mi+1]|| £ || B[Mi ]2 || .
There are two basic quadratic methods, odd-even elimination and odd-
even reduction. The latter usually goes under the name cyclic reduction,
and can be regarded as a compact version of the former. We will show that
odd-even elimination is a variation of the quadratic Newton iteration for
A \ while odd-even reduction is a special case of the more general cyclic
reduction technique of Hageman and Varga [HI] and is equivalent to block
elimination on a permuted system. Both methods are ideally suited for use
on a parallel computer, as many of the quantities involved may be computed
independently of the others [H3]. Semidirect methods based on the quadratic
properties are discussed in Section 7.
4 .A. Odd-Even Elimination
We first consider odd-even elimination. Pick three consecutive block
equations from the system Ax = v, A = (a^.b^jCj):
V1V2 + bk-iVi + ck-ixk = Vi
Vk-i + Vk + Vfen “ \
ak+lXk + hk+l^+l + Ck+lXk+2 = vk+l
If we multiply equation k- 1 by
-\b
-1
k- 1 ’
equation k+1 by
■Ckbkil* and 3dd»
the
result is
38
(4.2)
("akbk-lak-l)xk-2
+ (bk ' Vk-Vk-l - Ckbk!iak+l)xk
+ (-Ckbk+lCk+l)xk+2
= (vk " - CkClVl)-
For k = 1 or N there are only two equations involved, and the necessary
modifications should be obvious. The name odd-even comes from the fact
that if k is even then the new equation has eliminated the odd unknowns
and left the even ones. This is not the only possible row operation to
achieve this effect, but others differ only by scaling and possible use of
a matrix commutativity condition.
By collecting the new equations into the block pentadiagonal system
(2)
H^x - v , it is seen that the row eliminations have preserved the fact
that the matrix has only three non-zero diagonals, but they now are farther
apart. A similar set of operations may again be applied, combining equa-
tions k-2, k and k+2 of to produce a new equation involving the unknowns
f 3)
x^_^> x^ and x^^ • These new equations form the system H^x = vv .
The transformations from one system to the next may be succinctly ex-
pressed in the following way. Let = A, v^ = v, m = flog,, N*1 , and define
Hi+1 = (I + CfV)Hi’
v(l+n - (I + C[H.])v(l), i - 1,2,. ..,m.
Block diagrams of the H sequence for N = 15 are shown in Figure 4.1a. The
distance between the three non-zero diagonals doubles at each stage until
the block diagonal matrix H , is obtained; the solution x is then obtained
nH-l
, - 1 (nH-1)
by x = H ,vv
7 nH-1
40
We note that the same elimination principle may be applied to matrices
of the periodic tridiagonal form
It is only necessary to include row operations that "wrap around" the matrix,
such as
row(N) - a b 1 row(N-l) - c b 1 row(l).
N N- 1 N I
An H sequence for N 3 16 is diagrammed in Figure 4.1b, and is constructed by
the same general rule as before, though there are serious and interesting
complications if N is not a power of 2.
Clearly, we can define an H sequence for any partitioned matrix A, and
only have to establish conditions under which the sequence may be continued.
In fact, the sequence ||b[H^]|| is quadratic when || B[A]|| < 1.
Theorem 4.1. If || B[A] || < 1 then || B[H1+1j|| * || B[H._]2 || and || H.+1 1| ^ || H . || .
If || C[A ] 1 1 x * 1 then || C[H1+1] ||x * || C^]2^ and || * || hJj .
Proof. We have 1^ - A, Hi+1 = (I + = (21 - =
2H. - H1D[H1]_1H1 - Hi(2l - D[Ht]“ 1H£) » H^I + B[H. ]) . Suppose || B[H^] || < 1,
II C[H^]||j< These assumptions are independent of one another and will
be kept separate in the proof. Since Hi * DfH^JCI - = (I - CfH^pDfH^],
we have H1+1 a D[H.](I - B[HJ2) - (I - C[H1]2)D[Hi]. Setting S - D[B[Hi]2 ] ,
41
T - D[C[Hi ]2 ] , we obtain D[Hi+1] 3 D[Hi](I - S) 3 (I - T)D[Hi] . Since
|| Sj| s || B[ HL ]2 || < 1, || T || x * || CCHi]2 || x < 1, and is invertible,
it follows that is invertible. We have B[ + ^ ] = I - D[ ] H^j3
I - (I - S)'1D[Hif1D[H.](I - B[H.]2), so B[H._]2 3 ( I - S)B[H.+1] + S;
2
similarly CfH^] ■ C[Hi+1](I - T) 4- T. Now, S is block diagonal and the
block diagonal part of B[H.+^] is null, so |B[H^] | = |(I - S)B[Hi+^]j + |s|.
2
Lemma 2.2 now applies, and we conclude || B[H^+^]|| ^ || B[H^] ||. Lemma 2.4
implies that || C[H.+1] 1^ * || CtH.]2^. To show || H.+l|| * || H.|| if
||B(H.]ff< 1, write H. + 1 3 D[H.](I- D[ H . ]" ^DfH. ] - HJBCH^) 3 D[H±J
N
+ (H. - D[H. ])B[H. ]. We then have, for 1 s 1 s n.,
i L 1 ~j=1 J
Px(H1+l) * ^(D[Ht]) + 3i(Hi - D[H.]) || B[H.]||
£ 02(D[H.]) + C£(H. - D[H.])
3 0i(H.) .
To show || H^JIj * || Hj^ if || CC^JHj < 1, write H. + 1 3 (I - C[H.1(D[H.] - H. ) x
D[H,. fSorH. ] =. D[Hi] + C[Hi](Hi - D[H^]) . We have
Y2(Hi+l) * VD[Hi]) + II C"HilH - °tHi^
5 Y£(D[H. ]) + y£(H. - D[Ht])
=* v^CH^ . QED
For a block tridiagonal matrix A we have BtH^^] 3 0 since is
block diagonal; thus the process stops at this point. The proof of Theorem
4.1 implicitly uses the convexity- like formulae
Hi+1 = Hi B[Hi] + D^Hi](I ‘
3 C[Hl Hl+ (1 - C[Ht])D[Hi].
42
As with Theorems 3.2 and 3.3, it is not difficult to find counterexamples
to show that Theorem 4.1 does not hold for arbitrary matrix norms. In
particular, it is not true that || B[A ] |J2 < 1 implies || B[H2 ] j(2 £ || B[HX] |(2
or even that || B [ H2 ] 1 1 2 < 1.
The conclusions of Theorem 4.1 are that the matrix elements, considered
as a whole rather than individually, do not increase in size, and that the
off-diagonal blocks decrease in size relative to the diagonal blocks if
j| B [A ] { j < 1. The rate of decrease is quadratic, though if || B[A]|| is very
close to 1 the actual decreases may be quite small initially. The relative
measure quantifies the diagonal dominance of H^, and shows that the diagonal
elements will always be relatively larger than the off-diagonals. One con-
clusion that is not reached is that they will be large on an absolute scale.
Indeed, consider the matrix A = (-l,2+e,-l), for which the non-zero diagonals
of are
(-l,2+et,-l) / l£:J(2+ek)
S’ ek+l = 4ek + ek '
For small e the off-diagonal elements will be about 2^ ^ and the main diagonal
2 i.
elements about 2 in magnitude. While not a serious decrease in size,
periodic rescaling of the main diagonal to 1 would perhaps be desirable.
As already hinted in the above remarks, a more refined estimate of
|| B[HX+X]|| may be made when dealing with tridiagonal matrices with constant
diagonals [S2]. Let A = (a,b,a) , so that
1
43
b-a2 b
-a2/b
b-2a2/b
-a
b-2a2/b
-a2 b
- a b
2 |a2/b I / j b-2a^/b I
-a
b-2a2.'b
0
b2-a2/b /
It follows that when 2|a| < jbj, || B[H?]|
'(2 - ||B[H1]2||), and thus |j| B^T || < |f B[H2 ] || < ||b[H1]2||. Note that
does not have constant diagonals, nor will any of the other H matrices,
but most of the rows of H, are essentially identical, and it is these rows
which determine |j B [ ] f J - This property will be important for the odd-even
reduction algorithms.
When A. = (a^,b ,Cj) is block, diagonally dominant, we can write
Hi ' (aJi),0,...,0,bj(i),0,...,0,cja)), S<D = !|(b^i))-1||(|^i)||+ |Mji> il ) ,
,(i)
,(i)
maXj j* ' Without proving it here, we can show that if 3'‘i' <1 then
3 ( i"*" 1 ) s 3^2. An irreducibility condition plus 0^ £ 1 yields the same
conclusion.
We also note that if A is symmetric then so is IL, i a 1, and thus
B[Hi]
Theorem 4.2. If A is block tridiagonal and positive definite then so is H.,
i i 1, and P(B[H.]) < 1.
Proof. First we observe that each is a 2-cyclic matrix [V3] since it can
be permuted into block tridiagonal form. Suppose that is positive definite.
Since * D[H^] is also positive definite, there exists a positive definite
matrix such that
E2t. ut so ft1+1- 2a. - a2;
44
IL and Hi+1 are positive definite iff Hi and Hi+1 are positive definite.
Since is a positive definite 2-cyclic matrix, we know from Corollary 2
of Theorem 3.6 and Corollary 1 of Theorem 4.3 of [V3] that P (B^) < 1, 5^= I - 1L
Since = I - D^IL = E^B.EL , f°lf°ws that p (B^) < 1 also. Let X be
an eigenvalue of H^. Then -1 < 1-X < 1 since PClL) < 1, and so we have
0 < X < 2 and 0<2X - X^ Si. But 2X - X^ is an eigenvalue of = 2H^-H^,
which is therefore positive definite. QED
Any quadratic convergence result such as Theorem 4.1 calls to mind the
Newton iteration. If ‘Ji(Y,Z) = Y(2I - ZY) = (2 1 - YZ)Y, Y(0) is any matrix
with || I - ZY ^||a< 1 for some norm || • ||a, and Y^i+^ = 3l(Y^ ,Z) , then
Y^ converges to Z 1 if it exists and (I - ZY^i+^) = (I - ZY^)^[H8].
It may be observed that = 3l(H^,D[H^] ^) , so while the H sequence is not
a Newton sequence there is a close relation between the two. Nor is it
entirely unexpected that Newton's method should appear, cf. Yun [Yl] and
Kung [K2] for other occurrences of Newton's method in algebraic computations.
Kung and Traub [K3] give a powerful unification of such techniques.
The Newton iteration Y^+^ * 2t(Y^ ,Z) is effective because, in a
certain sense, it takes aim on the matrix Z ^ and always proceeds towards it.
We have seen that odd-even elimination applied to a block tridiagonal matrix
eventually yields a block diagonal matrix. The general iteration
H.^ 3 S incorporates this idea by aiming the i— step towards
the block diagonal matrix DfH^]. Changing the goal of the iteration at each
step might be thought to destroy quadratic convergence, but we have seen
that it still holds in the sequence || B [ ] 1 1 -
4.B. Variants of Odd- Even Elimination
There are a number of variations of odd-even elimination, mostly dis-
cussed originally in terms of the odd-even reduction. For constant block
diagonals A = (a,b,a) with ab = ba, Hockney [H5] multiplies equation k- 1
by -a, equation k by b, equation k+1 by -a, and adds to obtain
('a2)\-2 + (b2 - 2a2)xk + ('a2)xk+2 = (bvk - avk-l ' avk+l}-
This formulation is prone to numerical instability when applied to the dis-
crete form of the Poisson equation, and a stabilization has been given by
Buneman (see [B20] and the discussion below).
With Hockney's use of Fourier analysis the block tridiagonal system
Ax * v is transformed into a collection of tridiagonal systems of the form
(- 1 , \,- 1) (sj) = (tj)> where \ ^ 2 and often X » 2. It was Hockney's obser-
vation that the reduction, which generates a sequence of Toeplitz tridiagonal
matrices (-1,X^,-1), could be stopped when l/X^ fell below the machine pre-
cision. Then the tridiagonal system was essentially diagonal and could be
solved as such without damage to the solution of the Poisson equation.
Stone [S2], in recommending a variation of cyclic reduction for the
parallel solution of general tridiagonal systems, proved quadratic conver-
gence of the ratios l/X^ and thus of the B matrices for the constant
diagonal case. Jordan [Jl], independently of our own work [H2], also extend-
ed Stone's result to general tridiagonals and the odd-even elimination algo-
rithm. The results given here apply to general block systems, and are not
restricted solely to the block tridiagonal case, though this will be the
major application.
46
We now show that quadratic convergence results hold for other formu-
lations of the odd-even elimination; we will see later how these results
carry over to odd-even reduction. In order to subsume several algorithms
into one, consider the sequences = A, v^ = v,
»i+i • Fi<2°rsi) -
v(l+1> = F . (2D[H. 1 -
1 L 1J 11 ’
where F^, G^ are nonsingular block diagonal matrices. The sequence H cor-
responds to the choices F. = I, G. = DfH.l .
i l • i
In the Hockney- Buneman algorithm for Poisson's equation with Dirichlet
boundary conditions on a rectangle we have
(a,b,a) , ab = ba, a
(i)>,
a, b
b,
F = diag (bv“')» G
ba) - b, b(i+» 1
.<» - a, a<i+1) =
DCH.f1,
b<i)2 - 2a(i)2,
-a
(i) 2
Sweet [Sll] also uses these choices to develop an odd-even reduction algo-
rithm. We again note that the matrices lose the constant diagonal prop-
erty for i 2 2, but symmetry is preserved. Each block of H. may be repre-
sented as a bivariate rational function of a and b; the scaling by en-
sures that the blocks of block rows j2i of are actually polynomials in
a and b. In fact, the diagonal blocks of these rows will be b^i+^ and the
non-zero off-diagonal blocks will be a^i+^.
Swarztrauber [S6] describes a Buneman-style algorithm for separable
elliptic equations on a rectangle. This leads to A = (a^.b^Cj) where the
blocks are all n x n, and a. = ry.I , b. = T + 3.1 , c. = y.I . In this
J J» J > J J n
case the blocks commute with each other and may be expressed as polynomials
47
in the tridiagonal matrix T. Swarztrauber' s odd-even elimination essenti-
ally uses = diag(LCM(b i- 1 > bj+2i-l^ » = Tn for j s 0 or
j 2 S + 1, G. = D[Hl ]” 1
Stone [S2] considers a similar algorithm for tri-
diagonal matrices (n = 1) but with = diagfb^^i-l bj+2i-l^’ Gi =
Note that symmetry is not preserved.
In all three cases it may be shown by induction that H
i+1
F F.H. ,
l 1 l+l
and v(i+1) = F^...F^v^+^\ These expressions signal the instability of
the basic method when F^ ^ I [B20]. However, in Section 2 .A we showed that
the B matrices are invariant under premultiplication by block diagonal
matrices, so that BfH^] * B[H^].
(i)
The Buneman modifications for stability represent the vv sequence by
sequences p^ , q^ with v^ = + q^. Defining = D[H. ],
Q. = D. - H., S. = D[Q.G.Q.] and assuming D.G.Q. = Q.G.D., we have
xi i li LXi ixiJ & i ixi xi i i’
,(» -
(i+1)
0, q
(1)
.CD
v,
p(l) + D^1(Q.p(i) + q(i))
q(l+1) = F.(Q.G.q(l) +S,p(l+1)).
Theorem 4.3. Let P = I, P. = B[H._n ] . . ] , i
o(I) ■ <VPi • V*-
Then p
(i)
(I - P^x,
Remark. Although not expressed in this form, Buzbee et al. [B20] prove spe-
cial cases of these formulae for the odd-even reduction applied to Poisson's
equation.
Proof. To begin the induction, * 0 = (I - P^)x, q*'^ = v = H^x
3 (®1^1 " Suppose p^ = (I - E\)x, q ^ * (D^P^ - Q^x. Then
■Ar
LCM a least common multiple of the matrices considered as polynomials in T.
43
p(l+1) = (I - Px)x + - P.)x + (D.P. - Ql)x)
= (I - P. + 5T 1Qi - D^Q.P. + D^D.P. - D^1Q.)x
= (I - B[H.]P.)x
= (I - Pi+1>*
, ( i+1) _ - (i+1)
= Hi+lX - W1 - Pi+l)X
(P>i+l " ^i+1 ' °i+l + °i+lPi+l)x
= <5i+lPi+l " WX‘
We note that when = DfH^] 1 and S = || B [ A ] 1 1 < 1 then ||P.J| =
and Pk(D.P. - Q.) * Pk(D . ) || P . || + Pk(Q.) * ^(D.) + ok(Q.) = Pk(H.) , so
|| D^I\ - Q^JI £ || || . It also follows that || x - || ^ || K|| || x|| and
|| q^"1 + Q^xJI £ || || || x|| . Thus p^ will be an approximation to x;
Buzbee [B17] has taken advantage of this fact in a semidirect method for
Poisson's equation.
/"'-I
4 .C. Odd- Even Reduction
4.C.I. The Basic Algorithm
The groundwork for the odd-even reduction algorithms has now been laid.
Returning to the original derivation of odd-even elimination in (4.1) and
(4.2), suppose we collect the even-numbered equations of (4.2) into a
linear system A*'2^2'* = w^, where
A< ' ' ('S2jb2 j-ia2j-l,b2 j" a2Jb2J-lc2j-l * °2 jb2 j+l*2J+l’"C2 j^j+l^J+l’iy
*«> - <*2)V, '
49
W< } " (V2j " a2jb2j-LV2j-l " C2jb2j+lV2j+l)N2 =
N2 = LN/2J.
(2)
This is the reduction step, and it is clearly a reduction since A is half
(1) (2) (2) (2) (2)
the size of A =* A. Once A x = w has been solved for x we can
compute the remaining components of x^ * x by the back substitution
X2j-1 = b2j-l(v2j-l ' a2j-lx2j-2 " C2j-lX2j)’ j = N/2'’
Since is block tridiagonal we can apply this procedure recursively
to obtain a sequence of systems A^x*'^ = , where A^ = A, w^ = v,
x(L) = (xj2i-l>Ni» - N, N.+1 = LN./2J. We write Avi; = (a^; ,c >
x^ = x . 2 l- 1 • The reduction stops with A^x^ = w^m\ m = log^ \j+l”
(note the change in notation) since A^ consists of a single block. At
this point the back substitution begins, culminating in x*"^ = x.
m rn+ 1“ i
Before proceeding, we note that if N = 2 -1 then = 2 - 1. The
odd-even reduction algorithms have usually been restricted to this particular
choice of N, but in the general case this is not necessary. However, for
A = (a,b,a)^ there are a number of important simplifications that can be
made if N = 2m - 1 [B20]. Typically these involve the special scalings men-
tioned earlier, which effectively avoid the inversion of the diagonal blocks
during the reduction.
It is also not necessary to continue the reduction recursively if
(k) (k) (k)
A x = wv can be solved more efficiently by some other method. This
leads to various polyalgorithms and the semidirect methods of Section 7.
By the way in which A^ has been constructed, it is clear that it
(4)
consists of blocks taken from H^. For example, in Figure 4.1a, A is
(*> = /-(T) u(i) „(*)■
50
the (8,8) block of . It follows immediately that || A
* H .
B[Hi] ||, and || C[A(i) ] * || C[H(l) ] || . As a corollary
of Theorem 4 . 1 we actually have
Corollary 4.1. If ||b[A]||<1 then || B[A( L+1) ] || ^ || B[A (l) ]2 || and
||A(l+l)||i If ||C[A]||1< 1 then || C[A( I+1> ) ||j s || C(A(1) ]2 ||j
anh ||A<1+1)||lS Ha'0^.
1 1.1 j _ r • ! j £ -m *■» / i _ a. A ✓ i \ _ / i I "* 1 _ II 1 1 “ 1
and A
Recalling the definition of Theorem 3.4, let X(A) = maxj(4||bj a_. || ||b_.
Theorem 4 .4 . If X(A) £ 1 then, with X^ = X(A^), we have
t(l+1) <; (X(l)/(2 - \(l)))2.
Proof. It will suffice to consider i = 1. We have
CT2j " b2ja2jb2j-lC2j-l + b2jC2jb2j+ia2j+l
(2) ,-l
3j = -a2jb2j-la2j-l’
c(2) = -c b'1 c
Cj C2j 2j+lC2j+l’
II CT2 j H ^ II b2 ja2 j II II b2 J- 1C2 j- 1 II + II b2j+ia2j+l
!Rl“a2 j)~ 1 II * (1 - V2)_1 = 2/(2 - X),
, I, (2)-l (2) || ,| (2)-l (2) ||
4Hbj aj llllbj-l Cj-lH
^ ^ lid " o>2J)“1|| ||(1 - ^2j.2)_1|| (II ^2 ja2 j
(II b2 j- Ia2 j- 1 II II b2 j-2C2 j-2 I' )
s 4(2(2 - X))2(V4)2 - ( V (2 - X))2.
b2jC2jH * 2(X/4)
b2 j- 1C2 j- 1 II }
Thus X(2) s (X(1)/(2 - X(1)))2.
51
This may be compared to earlier results for constant diagonal matrices,
(2) 2 2
which gives an equality of the form 3V = a / (2 - 3 ) . See also Theorem
4.8 of [ V3 ] .
4.C.2. Relation to Block Elimination
By regarding odd-even reduction as a compact form of odd-even elimina-
tion we are able to relate it to the Newton iteration for a"*. It is also
possible to express the reduction algorithm as block elimination on a per-
T
muted system (PAP ) (Px) = (Pv) (cf. [311], [E3], [W2]). In Section 5 we
show how this formulation also yields quadratic convergence theorems.
Specifically, let P be the permutation matrix which reorders the vector
T 0
( 1 ,2 , 3 , . . . ,N) so that the odd multiples of 2 come first, followed oy the
1 2
odd multiples of 2 , the odd multiples of 2“, etc. For N = 7,
P(l,2, . . . ,7)T = (1,3,5,7,2,6,4)T, and
This factorization gives rise to a related method, first solving L' z = Pv,
then U' (Px) 3 z. It is easily seen that z^ =
We may now apply any of the theorems about Gaussian or block elimina-
T
tion to PAP and obtain corresponding results about odd-even reduction and
the above modification. In particular, Theorem 3.2 applies though its
conclusions are weaker than those of Corollary 4.1. It follows from the
T 7 (i)
PAP = LL factorization as well as Theorem 4.2 that A will be positive
definite if A is; Theorem 4.2 also allows the conclusion that p(B[A^]) < 1.
54
The odd-even reduction is also a special case of the more general
cyclic reduction defined by Hageman and Varga [HI] for the iterative solu-
T
tion of large sparse systems. Since PAP is in the 2-cyclic form
PAP
and D? block diagonal, we can premultiply by
f 1 'A12D2
V^l0'!1
to obtain
A12D2 A21
It is clear that A
(2)
°2 ' A21°l A12>
°2 " A21°l A12 and (D1 " A12°2 A21’
0) are the rows
of ^ that have not been computed as part of the reduction. In consequence
of this formulation, if A is an M-matrix then so A
P(B[A(1+1)]) < p2(B[AU)]) < 1 [HI].
(i)
and
The block LU factorization PAP = LAU, where A is block diagonal and
L(U) is unit lower (upper) triangular, determines the reduction and back
substitution phases. Norm bounds on L and U were given in Section 3. A. 2
for general matrices with || B[A]|| < 1. For block tridiagonal matrices in
the natural ordering we have || L 1 1 ^ £ 1 + || CfAjHj , || L|| £ n|| C[A]||^ if
|| G[A] || x < 1, ||U|| S 1 + || B[A]||, || U||x * 1 + n|| B[A]|| if ||B[A]||<1,
n = maXj n ^ . For the permuted block tridiagonal system only the bounds on
|| L || and 1 1 U 1 1 ^ wi 1 1 grow, and then only logarithmically. By the construc-
tion of L and U,
55
|| L || s 1 + max || C[A(l)]|| ,
15iSm 1
INI * 1 +_ II C[A(1) ] ||,
i-1
Hull I*l+L II B[A(i> ] || ,
i-1 1
II U II s 1 + max || B[A(l) ] ||.
l^i^n
Since C[A<'l)] is block tridiagonal, p^CTA^]) 5 2n || C[A ^ ] || , so
|| C[A('I') ] || 5 2n || C[A(l) ]|| L; similarly || B[A(l) ] || 1 5 2n|| B[A(l) || . If
II B[A] || < 1, || C[A ] || < 1, we then have
|| Lllj * 1 + || CCAJHj
|| L|| 5 1 + 2nm|| C[A]||lf
|| U|| x 5 1 + 2nm|| B[A] || ,
|| U || ^1+ || N[A] || .
Much like the proof of Corollary 3.4 we also have, if || B[A]||
> x <
SO
and
lib
(i+l)
tf-
(i+l)
j
b(i)n
b2j (1 a2j )
II B[A
(i)]2
b^}|| (1 + Si> , S. = || B[A(i)]2
II (bji+1)) “1 1|
... (i+l).
cond(bj )
(b^V1!!/ (i - tj.
cond(b
(i)
2 j
)(1 + S.) / (1 - 3.).
56
Since our interest is in bounding cond(b^ji+^)
must simplify the above expression. Let = II
that
in terms of cond(b^) , we
j=1(l + 3 j ) / ( 1 - Bj), so
cond(b^1+1))
5 [max^ cond(bk)]Ki
by induction.
Using |3
we obtain
Kis"1i-i(1 + 3j)/<1- ^.i>
= (i + s.)/[(i - spn^Jd - s.)]
< (1 + 3.)/(l - S^1.
This is a rather unattractive upper bound since it may be quite large if 3^
is close to 1. Actually, the situation is not so bad, since
II II ^ Jill on(^ -i^ll _
2 j
'll<fll B[A ] || = 2^i' and
cond(b^1+1)) < [maxj^ cond(bk) ] j _ ^
2+3.
If X(A) s 1 then || <j<J} || <; X(i)/2, \(l) = X(A(l)), and
cond(b^L+1)) £ [maxk cond(bk) ] 2 + X
(i)
2 - X
(i)
^ 3 ma3C[c cond(bk>
When A is positive definite it follows from various eigenvalue- interlacing
theorems that cond(b*j^) < cond(A<'^’^) < cond(A).
4.C.3. Back Substitution
Now let us consider error propagation in the back substitution. Suppose
yki+^ ■ x,<i+1) + ?k^+^\ 1 ^ k S defines the computed solution to
. ( i+1) (i+1) „ (i+1)
A x a w
, and we compute
57
1
Then
*J- 1
(b^ )'1(W(1) - a' ' y'~ - . c
V 2j-l; V 2j-l 2i-l yi-1
.(i)
(i+1)
2J-1 yj-l
(i) (i+1).
2J-1 yj
+ e
2 j— 1 *
y(« , y(i+1>
y2j yj
§(1)
^2 j- 1
rw
' 2j- 1
.(i)
'2j-l
(b^ )"1 a^ |(i+1) _ (b^ c(i) e(i+1^
V 2j-l; a2j-l *J-1 ^j-l* C2j-1
+ e
(i)
2 j- 1 *
-(i) , -(i+1)
a2 j -j
Now define the vectors
.(i)
.(i)
zl " <ei * ° ’ «3 » °’
.(i+1)
(i+1)
(0, 5 > o, .
We have |j z^\\ - || *(i+^|| by construction, and
?(l) = B(l)z2 + zx + z2, B(l) =■ B[A(l)].
But in consequence of the construction we actually have
If || B[A] || <1 then || B(1)z,|| £ || B
§(L) || ^ max(|| B(l)z2 + zjl , || z2 || ) .
(i)
z2 II < II Z2II •
Suppose || ?(m) || £ e and || e(l)|| * e for all i =* l,...,m-l.
II ?(l) II s ™*<il B(i) || || z2|| + || z1||, || z2 || )
* II Z2II+ II ZJI
(i+Dn I, e(i)
s (m - (i+1) + 1) e + e
“ (m - i + 1) e.
Then
(by induction on i)
58
and || || , the error in the computed solution of Ax = v, is bounded by
me, m => rlog2 N + l"! . This is certainly a satisfactory growth rate for
error propagation in the back substitution. In fact, it is a somewhat
pessimistic upper bound, and we can more practically expect that errors in
the solution of x^+^ = w^+^ will be damped out quickly owing to
the quadratic decreases in || || .
4.C.4. Storage Requirements
On the matter of fill-in and storage requirements, we note that block
elimination in the natural ordering of A, as discussed in Section 3. A, re-
quires no additional block storage since there is no block fill-in. However,
it has already been seen that odd-even reduction does generate fill-in. The
amount of fill-in is the number of off-diagonal blocks of A ^ , ...,A^m\
m
which is 2(N. - 1) < 2N. Since the matrix A requires 3N - 2 storage
~i=2 L
blocks, this is not an excessive increase. Of course, we have not cons id red
the internal structure of the blocks, as this is highly dependent on the
specific system, nor have we considered temporary storage for intermediate
compu tat ions .
We do note, however, that if the factorization need not be saved, then
some economies may be achieved. For example, in the transition from A^
(2) (2) (2) (2)
to A , the first equation of A x = w is generated from the first
three equations of A^x^ = w^ . The first and third of these must be
saved for the back substitution, but there is no reason to save the second
(2) (2) (2) (2) (2) (2)
once Av xv = wv is formed. Thus the first equation of A^ xv = wv
may overwrite the second equation of A^x^ * w^ . Similar considera-
tions hold throughout the reduction, but some modifications are needed on
the CDC STAR-100 due to its restrictions on vector addressing (see Section 10. B).
59
The special scalings of odd-even reduction for systems derived from
separable elliptic PDE's allow implicit representation of the blocks of
as polynomials or rational functions of a single matrix variable.
Since the blocks of A are quite sparse this is necessary in order to limit
intermediate storage to a small multiple of the original storage. Unfor-
tunately, there does not seem to be a similar procedure that can be applied
to nonseparable PDE's. The only recourse is to store the blocks as dense
matrices, and on current computing systems this will limit them to small
sizes .
4.C.5. Special Techniques
Now suppose that A = (-p ,t + p + q , -q ) , p ,q ,t i 0, p = q = 0;
J J J J J J J J IN
we would like to apply Babuska's tricks for the LU factorization (Section
3 .A. 4) to the factorization induced by odd-even reduction. This only par-
(2)
tially succeeds. First, we can represent A
where
(2) (2) (2) (2) (2)
(-p . ,t. +p. + q . ,-q . )
j j j j j
pj
(2)
.(2)
.(2)
qj
P2 j b2j-l P2 j— 1 *
C2j + P2 j b2j-l *"2 j- 1 + q2j b2j+l C2j+1’
q2j b2j+l q2 j+1 ’
Ck + Pk + V
(2)
Note that t^^ 5 £ b2j" tbe ot^er han^» the technique of avoiding
T
the back substitution by combining the LU and UL factorizations of PAP
_T
will not work. Let A' = PAP
(I + L) (D + U) = (I + U*) (D* + L*) , (I + L)f = Pv,
JU J. ^ ^
(D + U) (Px) = f, (I + U )f 53 Pv, (D + L ) (Px) = f , so that
(D + D* - D[A' ]) (Px) =* f + f* - (L* + D(A' ] + U) (Px) .
60
The only way to have A' = L + D[A' ] + U is if no fill-in occurs in
ie
either factorization, in which case L = L[A' ], U = U[A]. In particular
fill-in occurs in odd-even reduction, so the method does not work in its
entirety. However, it will still be advantageous to do the reductions by
developing the t^ sequences. (See [SI] for some examples.)
SchrBder and Trottenberg [SI] also consider a method of "total reduc-
tion" for matrices drawn from PDE's and finite difference approximations
on a rectangular grid. Numerical stability and further developments are
discussed in [Sla], The total reduction is similar to odd-even reduction
(which is derived in [SI] by a "partial reduction"), but just different
enough to prevent the application of our techniques. Part of the problem
seems to be that total reduction is more closely related to the PDE than
to the matrix, though the algorithm is expressible in matrix form. Various
convergence results are undoubtedly obtainable, but we have not yet deter-
mined the proper setting for their analysis.
4 .D. Parallel Computation
Finally, the utility of the odd-even methods for parallel or pipeline
computation has been demonstrated in several studies [H3]. Stone [ S2 ] ,
Lambiotte and Voigt [L2], and Madsen and Rodrigue [Ml] discuss odd-even
★
reduction for tridiagonal systems, and conclude that it is the best method
to use when N is large (>256). Jordan [Jl] considers odd-even elimination,
which is shown in [Ml] to be most useful for middle ranges of N (64 < N < 256)
on a particular pipeline computer, the CDC STAR-100. For small values of
N the sequential LU factorization is best, owing to the effect of lower order
terms engendered by the pipeline mechanism.
★
The basic method given here is a block analogue of the method examined in [L2].
61
The inherent parallelism of the odd-even methods is in formation of
. . c , (i+1) ( i+1) ( i+1) (i+1) .
the equations of A x = wv or lL+^x = vv and in the back
substitution for odd-even reduction. It is readily seen from (4.2) that
H0 and v^ may be computed from and v^ by
compute LU factors of b^, (1 £ k £ N)
solve bk[^ ck vfc] - [ak ck vk], (1 * k £ N)
- bk ' Vk- 1
- Vk+l’ (1
s k £ N)
<2>
/V
*" Vk ' ak\-l
- Vk+i’ a
£ k £ N)
42)
- -Vk-1’ (3
s k £ N)
<2)
** "CkCk+l ’ (1
^ k s N - 2)
(Special end conditions are handled by storing zeros strategically or by
(2) (2)
shortening vector lengths slightly. A ' and w ' are computed by only gen-
erating new equations for even k; the back substitution is then rather
(2)
straightforward once x is computed.
The algorithm for odd-even reduction, N = 2m-l, all blocks n X n,
* - 1, is as follows: for i = 1 m- 1
compute LU factors of b^j^ , , (0 ^ k ^ N. ,)
Z k.+ 1 l+l
62
solve b$m) x|m) = wjm)
for i = m- 1 , . . . , 1
solve
soive D2k+1 x2k+1
(w(i) - a(i) x,(i+1)
' 2k+l a2k+l *k
c
(i) (1+1)
2k+l k+1
) , (0 £ k * Ni+1)
The ith reduction and corresponding back substitution uses
2 3 2 3
n + 0(n ))N^+^ “ 12 n arithmetic operations, and the total operation
2 3 2 3
count for the algorithm is (12-jN - 12m)n + 0(n N + n ) . This may be com-
2 3 2 3
pared to 4-jNn + 0(n N + n ) operations for solution of Ax = v by the LU
factorization .
★
For a crude estimate of execution tine on a pipeline computer, we
take tn + a, t « a, as the execution time for operations on vectors of
length N, and t, r « t « cr, as the execution time for scalar operations
[H3]. The total execution time for odd-even reduction would then be
232 3 2323
roughly (12— n + 0(n ))(tN + am) + 0(n t) versus (4— n N + 0(n N + n ))t
for the LU factorization done completely in scalar mode. A qualitative
comparison shows results similar to the more precise analysis given to
tridiagonal systems: for a fixed value of n, odd-even reduction will be-
come increasingly more effective than the LU factorization as N increases.
Odd-even elimination should maintain its utility for moderate values of N;
the factor tN + am is replaced by (tN + <7)m, but data manipulation is often
much simpler than in odd-even reduction. The actual crossover points be-
tween algorithms are implementation dependent and will be considered to
a limited extent in Section 10. C.
A critical observation on the execution time of odd-even reduction is
that roughly 50^ of the total time is spent in the first reduction from
*
A more detailed timing analysis is in Section 10. C and Appendix B.
J
63
A^ = A to . Special efforts to optimize the algorithm by solving
the reduced systems differently must take this fact into account. Further-
more, in timing each reduction it is seen that roughly 60$ of the high-
order term n^N comes from the matrix multiplications used to compute A^i+^
and 30$ from solving the 2n+l linear systems for a„, , etc. Relatively
4. Kr L
little effort is used to factor the b^'s or to make back substitutions
J
having saved the factorizations.
The use of a synchronous parallel computer creates some constraints
on implementation which are not so important on a sequential computer.
The first restriction in the interests of efficiency is that the blocks of
A should all be the same size. This allows simultaneous matrix operations
(factoring, etc.) to be programmed without resorting to numerous special
cases due to varying block sizes. In addition, pivoting causes ineffi-
ciencies when the pivot selections differ between the diagonal blocks.
As with the LU factorization, the assumption j| 3[A]|| < 1 only guarantees
that no block pivoting is needed. If A is strictly diagonally dominant
or positive definite, then no pivoting within the blocks will be required.
Thirdly, the choice of data structures to represent the matrices and
vectors must conform to the machine's requirements for the parallel or
vector operations. This is an important problem on the CDC Star owing
to its limitations on vector addressing and allocation [LI]. When A is
restricted to having small blocks this is not as serious as it would be
otherwise. In the case of the Illiac IV, the problem of data communica-
tion between processing elements must also be considered.
64
4 .E. Summary
We now summarize the results of this section. The odd-even elimina-
tion and reduction algorithms may be safely applied to the solution of
certain block tridiagonal linear systems, and in some cases various impor-
tant quantities involved in the solution decrease quadratical ly to zero.
The quadratic convergence may be explained by expressing the methods as
Newton-like iterations to compute a block diagonal matrix. Variations of
the basic methods, which have important applications to PDE's, also dis-
play quadratic convergence. Parallel computation is a natural format for
these methods, although practicalities place some restrictions on the
type of system that can be solved efficiently.
65
5. UNIFICATION: ITERATION AND FILL-IN
Having presented several linear and quadratic methods, we now treat
them in a unified manner as particular instances of a general nonstationary
iteration. We are also lead to an interpretation of quadratic convergence
in terms of matrix fill-in, and conclude that the odd-even methods are the
only instances of the general iteration which display quadratic convergence.
These results apply to arbitrary matrices satisfying |j B[A]||< 1 and not
only to the block tridiagonal case.
Let us first collect the formulae for the block LU factorization,
block Gauss- Jordan, and odd-even elimination.
LU: A.+ 1 = (I - L[Ai]EiD[Ai]“1)A1, Aj = A.
GJ: A.+1 = (I - (L[.\.] + UrA.])E.D[Ai]‘1)A., Aj = A.
OE: Hi+1 = (I - (L[H.] + ^H. , ^ = A.
Now consider the function F(X,Y,Z) = (2 1 - XY)Z, and observe that
Ai+1 = F(D^Ai^ + UAi]Ei, DCA.f1, A.)
Ai+1 - F(D[A.] + (L[At] + U[Ai])Ei, DfAj'1, A.)
Hi+1 = F(Hi’ D^Hi^1» Ht)
Thus it is necessary to analyze sequences based on repeated applications
of F.
For fixed nonsingular Y we can define the residual matrices
r (M) = I - MY, s(M) * I - YM, so F(X,Y,Z) = (I + r(X))Z and
r (F (X,Y,Z) ) - r(Z) - r(X) + r(X)r(Z),
s(F(X,Y,Z)) = S (Z) - s(X) + s (X) s (Z) .
66
By taking X = Z and assuming that || r(Z1) j| < 1 or || s(Zj) Jj < 1, the sta-
tionary iteration 3 FlZ^.Y.Z^) converges quadratical ly to Y This
is, of course, the Newton iteration.
The related operator G(X,Y,Z) = Z + X(I - YZ) 3 Z + Xs (Z) has pre-
viously been studied as a means of solving linear systems [H8]. When
Zl 3 X, Zi+^ 3 G(X,Y,Z^), Bauer [B8] observes that r(jL + ^) 3 r(X)r(Zi>,
s(Z1+1) 3 s(X)s(Zi), Z. = (I - rCXiSY"1 3 Y_1(I - s(X)S, Zi+R = G^.Y.L),
and in particular Z9^ = G(Z^,Y,Z^). This last formula also leads to the
Newton iteration, for G(Z,Y,Z) * F(Z,Y,Z), s(Z9i> = s(Z.)2.
In this section we study the nonstationary iteration Z ^ - A,
Z.,. = F(X.,Y.,Z.), where Y. = D[Z,] ^ and X. consists of blocks taken
i+l ill i L iJ i
from Z^, including the diagonal. The selection of blocks for X^ is al-
lowed to change from step to step, but we shall assume that the selection
sequence is predetermined and will not be altered adaptively. (The cost
of such an adaptive procedure, even on a highly parallel computer, would
be proh ib i t ive . )
For the solution of a linear system Ax = v we take (Z^,v^) = (A,v),
(Z^+^, = (21 ■ X^Y^) (Z^,v^) , with the sequence ending in or approach-
★ ★ * 1 *
ing (Z ,v ) , where Z is block diagonal. Then x = Z v . One necessary
condition for consistency is that 21 - X^Y^ = I + ri(X^) be nonsingular,
which holds if and only if p(r^(X^)> < 1. Since r^X^) and s^(X^) 3 I - Y^X^
are similar we could also assume p(s^(X^)> < 1. Using earlier notation,
ri(Xi> 3 C[X^], s^(X^) 3 B[X^]; the notation has been changed to provide
greater generality. In the context of an iteration converging to a block
diagonal matrix, the condition || B[A]|| < 1 (implying || s^X^) || ^ ||s^(Z^) || < 1)
states that A is sufficiently close to a particular block diagonal matrix,
namely D[A].
67
We first generalize earlier results on linear methods C ||s ^ (Z J j < 1 =»
|| si+i^Zi+x^ H 5 II s^(Z^) II ^ and t,len determine conditions under which we
obtain a quadratic method (|| || < 1 =» || si+1(Zt+1> || £ || si(Zi) ||2) .
Theorem 5 . 1 In the sequence Zi+1 = F^.Y^Z^ , suppose that
{CP ,P) | 1 ^ P ^ n} C Tt C f(p ,q) | 1 ^ p,q * N} = U,
X. = L E Z.E , Y. 3 D[Z. ]_1,
i. i pio i L i J J
i
si(M) = I - YJ1.
Then || s^CZ^) j| < 1 implies
II H * H II * H H -
Proof. Suppose || s^Z^ |j < 1. Define Q by Zi+^ = Z^ - DfZ^Q, so
Q = -Y.Z. + Y.X.Y.Z. = -s.(X.) + s.(X.)s.(Z.). Since D[s. (X.) ] = 0,
S = D[Q ] = D[s1(X1)st(Z1)]. Since s.(X.) = ^ . EpS .(Z.) Eq , || s.(X.) || * ||s.(Z.) ||
and || S || £ || si(Zi) ||2 < 1. Thus D[Zi+1] = DCZ±] (I - S) , Y.+1 = (I - S)"^.,
and s^CZ^^) = (I - S)s.+^(Zi+^) + s* Lemma 2.2 applies, so that ||s^(Z^ .) || < 1
implies || s.+1(Z1+1)|| s || st(Zi+1)||. Bue s.(Z.+1) - S.CZj) + Q -
si<Zl> * si«i> + si<Xi)si<Zi> ‘ WiC,*, + ^.Vi'Wi'V’
r = U - T^ so
+ o.(L E s. (Z.)E )
j TL p iv l q'
^ Dj(si(Zi)) .
Si(zi)
Thus || s.(Z.+1) || £ || s.(Z.) || < 1.
QED.
68
To see when the method is actually quadratic, we first establish that
2
at least some blocks of s^+^(Z^+^) are bounded in norm by || s^(Z^) || ■
Corollary 5.1. With the hypotheses of Theorem 5.1,
L E s. ,(Z., . ) E I £ I s,(Z.)
T. p l+l i+l q 1 1 i i
Proof. From
(5.1) (I - S) s.+1(Z. + 1) + S = s.(Z.) - s.(X.) + s.CX^s.CZ.),
S = D[si(Xt)s.(Zi)],
and E (I- S) = (I- S)E , we obtain
P P
(I - S) L E s . , . (Z. .)E + S
T. p l+l 1+1 q
\EP si(zi,E, - \ep Sl<Xl>E«
l+ +.EP WdW
But .i(Z1)Eq - s.(X.) - 31(Xi)Eq,
so (I - S) Sj Ep s1+1(Zl+1)Eq ■ Sj Ep s.tx^ sl(Z1)Eq - S. By Lemma 2.1,
Vp S1S-1(Z1+1)E,II S 'I Vp WS1(Z1)EJ S II sl<Xl)Sl<Zl)H S 1K<V
In certain cases Corollary 5.1 provides no useful information, since
we could have 2LE s. ,(Z. .)E =0. Block elimination is one example
i ^ p l+l l+l q
where this occurs. However, it is seen that if we are to have |js^+^(Z^+^) ||
2
£ || s.(Z.) || then we must have
(5.2) llEj.Ep st+1«1+1)Eq|| 5 II II2.
This is certainly true if T^ ia the null set, in which case we are generating
69
the sequence, so we shall assume that is non-empty and show that
(5.2) will not hold in general.
We set one final condition on T\ for notational purposes, though it
turns out to have practical consequences: if it is true that E Z E = 0
p l q
for any starting matrix Z^, then (p,q) must be in T.. For example, we
now write the Gauss- Jordan algorithm as
A. . = F(D[A.] + (L[A.] + U[A.])(Ij=1Ej) , D(A.]_1, A.).
i+1
Inclusion in of all blocks in Z^ known to be zero clearly will not
affect the sequence Z. , = F(X.,Y.,Z.). With the additional condition
i+l iii
on T., it follows that if (p,q) £ T. then E Z.E can be zero only bv
i i p i q j j
accident, and not by design.
Corollary 5.2. With the hypotheses of Theorem 5.1, only the choice = 6
yields a quadratic method for arbitrary starting matrices Z^.
Proof . Suppose that T1 ^ 6. Following the proof of Corollary 5.1, we
obtain
(I - s> Vp si+i<zi+i)E,
?L,E s . (Z . ) E + IL,E s.(X.)s.(Z.)E
p iv l q p i l l l q
which implies, by Lemma 2.1,
VP =i+i<zi+i)Eql
Er;Epsi(zi)Eq + s + sr;Epsi(xi,si(zi)EP'
1 1 r r
The presence of nonzero 1st order terms (2^,, EpS ^ (Z^)E^) in general dominates
the 2nd order terms (S + £_,E s . (X.)s . (Z.)E ), and we cannot conclude that
T'. p i i i l q
(5.2) will hold for arbitrary Z^ satisfying || s ^ (Z ^) | j < 1.
QED
70
Now let us consider what happens when block fill-in occurs, as it
will during the odd-even algorithms for block tridiagonals. For the off-
diagonal (p,q) block, p * q, we have E Z.E = 0 but E Z. ,E ^ 0. Since
p l q p 1+1 q
Zi+1 = 2Zi - Wi’ EpZi+lEq = EpXi('YiZi)Eq = EpXi Si<Zi)Eq’ and
II EpZi+lEqH * H EpXiH II Si(Zi)EqH- From <5a) and EpSEq = °» we have
(I - S) Ep Si+l(Zi+l)Eq = Ep Si(Xi} Si(Zi)Eq* WhiCh imPUeS II Ep Si+l(Zi+l>Eql
£ S + E s . (X.) s . (Z.)E s
11 p l l 11 q
Si(Zi)
(We cannot conclude that
II Ep Si+l(Zi+l)EqH * II Si+l(Zi+l)ll^ however->
As shown by these inequalities, fill-in produces small off-diagonal
blocks of Zi+1 when || s^(Z^) || < 1, and the corresponding blocks of
are quadratically small with respect to s^Z^). Thus one source of quadra-
tic convergence is from block fill-in during the process Z. , = F(X.,Y.,Z.).
l+l v i i l
In summary, we now have two explanations for quadratic convergence in
the odd-even methods when applied to block tridiagonal matrices. The first
is that the method is a variation of the quadratic Newton iteration for
A \ The second is that, in going from to H^_^, all the non-zero off-
diagonal blocks of are eliminated and all the non-zero off-diagonal
blocks of H. , arise from block fill-in. Each block of BTH. is there-
fore either zero or quadratic in 3[H^].
71
6. HIGHER-ORDER METHODS
For Poisson's equation, where A = (-1, b, -I), the choice N = 2™ - 1
leads to considerable simplification in the Buneman odd-even reduction
algorithm. The most important properties are A^ = (-1, b^\ -I)
(constant block diagonals) and the expression of as a polynomial in
b. However, in some applications other choices of N are much more natural.
Sweet [S10] therefore described a generalization of odd-even reduction
which could handle these cases and still preserve constant block diagonals
and a polynomial representation. Bank's generalized marching algorithm
[B5] for Poisson's equation uses one step of (very nearly the same) gen-
eralized reduction in order to control the stability of a fast marching
algorithm. This technique is analagous to multiple shooting methods for
ordinary differential equations.
In this section we analyze Sweet's method as yet another variation
of odd-even reduction, and show that it possesses higher-order convergence
properties. The algorithm actually considered here yields Sweet's method
through a suitable block diagonal scaling, and hence our results carry
through much as before. It must be emphasized, though, that the methods
of this section are not intended as practical methods for general block
tridiagonal systems. Rather, our purpose is to elaborate on the quadra-
tic properties of odd-even reduction by showing that they are special
cases of convergence properties for a family of methods.
6. A. p-Fold Reduction
The idea behind the generalization of odd-even reduction is to take
a block tridiagonal linear system involving the unknowns x^,x9,...,
and
72
to produce a new, smaller block tridiagonal system involving only the
unknowns x ,x„ . Once this system is solved, the knowns x ....
p 2p J p’
are back- substituted into the original equations to compute the remaining
unknowns. The entire process will be called p-fold reduction; odd-even
reduction is the case p = 2.
Let p S 2 be a given integer, and define = 'Jj/ pj . Starting with
(2) (2) (2)
Ax = v we generate the reduced system A x = w , where
A(2) = (aj2)> bj2)> cj2))x2 ' x<2) = <xpj>N > and the diagonal block b^2)
has dimension n.. For 1 s k s L + 1, define
PJ 2
bkp-p+l ckp-p+l
akp-p+2 ^kp-p+2 Ckp-p+2
akp- 1 kp-
When kp- 1 > N, either extend Ax = v with trivial equations or let be
redefined as a smaller matrix using the available blocks of A. Next, re-
partition A to have diagonal blocks
V V V b2p bN2p* \+i
and call the new system Ax = v. Three adjacent block rows of K are thus
(0
The p-fold reduction is obtained by performing a 2-fold reduction on A,
and the back substitution is obtained by solving the system
By taking
\p-tH-l 'kp-cH-l *kp-p+i\
v\p- 1
kp-1
i- V
"kp- 1 /
kp- p+1
I 0
kp-p+l'
c v
kp- 1 kp- 1
the reduction step simp lies to
(6.1)
■T
f’
, (2)
'k
42) = \p ‘ akp Vi ‘ "kp Vfl’
akp °kp- 1
bkp akp "kp-1 Ckp \p+l
Ckp ^kp+l
and the back substitution becomes
74
*kp-j °kp-j °kp-j *kp-p " Vkp-j ^kp’
1 * j * P-1.
Rather than computing the entire a and y vectors, a more efficient
technique is to use LU and UL processes on the A blocks:
for each k = 1,2,...,N2 + 1,
(LU process on A^)
^k, 1 = akp-p+l ’ ^k.l = bkp- p+1 ’
^,1 Vkp-pH-l’
for j = 2 .... ,p- 1
solve tAk, j- 1 Ck,j-1 Fk, j- 1 ^
'^k,j-l Ckp-p+j- 1 \,j-l^
\,j ^P-P+j Ak,j-1
"k.j bkp-fH-j akp-p+j Ck,j-1
■\,j Vkp-fH-j akp-p+j Fk,j-1
SOlve ^.p-l^p-l \p-l \p-ll
'°Tc,p-l Ckp-1 "^k.p-l"
75
for each k = 1,2,...,^,
(UL process on A^+^)
3k+l,-l " bkp+p-i; Vk+1,-1 = Ckp+p-l’
^+1,-1 Vk]>4-p-l
for j = 2, . .. ,p-l
solve ®k+l- j+l °k+l,-j+l Fk+l,-j+ll
r akp+p- j+1 vk+l j+1 ’\+l j+1 ^
B = b — c A
k+ 1 , - j kp+p-j kp+p-j H<+l,-j+l
y
Ckp+p-j Ck+l,-j+l
k+l,-j
co. — v — c F
_k+l,-j kp+p-j kp+p-j k+l,-j+l
solve ^kp+l r\pH-l^
^•^.p+l Vk+l,-p+l C°k+ 1 , - p+ 1 -
The block tridiagonal system A^x^ = w^ is now generated by (6.1) and
solved by some procedure, and we take advantage of the earlier LU process
for in the back substitution:
for each k = 1,2 N2 + 1,
^kp-l ^kp-1 \p-l *kp-p " Vkp-1 \p
for j = p-2 1
’'kp-jH-j Fk,j \,j *kp-p " Ck,j *kp-p+j+r
6 . B . Convergence Rates
As observed in Section 2, if j) B[A]||< 1 then, under the K partition-
ing for p-fold reduction, we have || B[£]|| £ || B[A]|| . By Corollary 4.1
it follows that p-fold reduction will yield || A^ || £ || ff|| = ||a|| and
II B[A( ^ ] || s || B[£]^ || s || B[A]||^. In fact, we can provide pth order
L
76
convergence, although our first result is somewhat weaker than might be
expected in comparison to earlier results.
We will actually consider a natural generalization of odd-even elim-
ination. Let Mp be a nonsingular block (2 p— l)-diagonal matrix,
M ( m , _,..., m m . , ) .
P J j'P+1 J,0’ j,p-l N
We want to pick M such that
P
M A = (r . , 0, ..., 0, s., 0, . . . ,0, t)
p J j j
C2)
MpA is a block (2p+l) -diagonal matrix; A is constructed by selecting
the intersections of the block rows p,2p,..., and block columns p,2p,...
Thus we have
(2) , (2)
a = r , b x = s
j JP j jP
c<2) = t. , W<2) - (M v). .
J JP J P JP
Now, the jth block row of M A is
P
~^k=- p+1
m. [block row (j+k) of A],
j > k
so that
rj = mj ,-P+l
sj = mj,-icj-i + mj,obj + raj,+i Vi
ti = mj,p-i cj+P-i'
We require that
mf, , c . , , + m, , b , . + m, , , a, , , - 0,
j,k-l j+k-1 j,k j+k j ,k+l j+k+1
I k | = 1 >2 , . . . ,p- 1 ,
77
where we take m. = 0, m. =0.
J.-P J.P
The special case p = 2, = (-a^b^ ^,1,-c^b^^), has been considered
in Section 4. A. For p > 2, define
d. = b.
J.-P+l J-p+1
dj,-k
‘‘j.-k
j,k
Vj,k
- b . . - a., d . ^ , ,c., ,,
j-k j-k j ,-k- 1 j-k- 1
k = p-2,...,l
aj~k d j , - k- 1 ’
k = p-2, ... ,0
= b.
j.P-1 j+P-1
— |) « p rj ra
j+k j+k j , k+ 1 j+k+ 1 ’
k = p-2 ) t * « i If
= -c d_1
j+k j,k+r
k = p-2 0 .
Then we mav cake
j,-k ^j.O “j.-l “j.-k+l
a .
1.0
m., =v. .v. ^ ... v ,
J »k j,0 j,l j , k- 1 ’
k = 1 ,2 , . . . , p- 1 .
The d , 's are derived by an LU factorization process, and the d 's
J>“K- j >+k
by a UL process. It follows from earlier results that if || C[A]||^< 1
then [j m^ s || , so the diagonals of decrease in size as one
moves away from the main diagonal.
Our first theorem partially generalizes Theorem 4.1.
78
Theorem 6.1. Let J = (-s.^r., 0, .... 0, -s.^t.), B = (-b.^a., 0, -b.^c,),
JJ J J J J J J
B =■ L + U, J = JL + Jy, II L || S a, II U || ^ a, || B || s 2m = 8, with * s
Then || J|| * 8?, ||jl|| s , || JyH * ±3>P .
Proof. Let RQ = 0, ^ = L(I - R^'V TQ = 0, Tfc = U(I - T^)'^,
k=l,...,p-l, S=R i+T , , so that
p- 1 p- 1
(I - S)JL = L (I - Rp_2)"1L(I - Rp_3)"1L ... L(I - R0)-1L,
(I - S)JIT = U(I - T „)'1U(I - T ,)-1U ... U(I - T.)-1U.
U p-Z p- j 0
2 - 1
Define '{r = 0 > t, = a (1 - i, .) , so
0 k k- 1
Kii* V llTklls V
||JLI|, II Jyll^oP/td- 2Vl)7^ (1- V],
* 0p/[2p_I(l - 2* ^Ijd- tk)].
Define 50 = °’ 51 = l’ 5k+l = 5k ' * 6k-l’ 30 6k = det(a'1>a()(k-l)x(k-l)> °*
Vk = 3 V 5k+l’ 1 ' *k = 'k+2 'k+1’ (1 " 2ip-l) “k=0(1 " V
(1 - 2 i ) 5 = 5
Define T = 5. l1 - a 5 so t = 1,
p-1 p p+1 p-1 k k+1 k-1 L
T2 ~ 1 - 2 or*" , ~k = Tk-1 - a2Tk_9. By induction, Tk(-) = 2 k+1,
d or = -2nk6k ^ £ 0. 3ut then, for m S 4, ~k(») 2 Tk(-|) , so 2P («») 2 1.
Thus || J|| s SP and || jJ| , || jJ| * j3P.
QED.
Corollary 6.1. If ||B[A]||<1, ||l[B[A]]||, || U[B[A j ] || * \ || B[A] || , then
|| B[A(2)]|| * || B[A] || P.
In generalization of Theorem 4.4, there is the very surprising
Theorem 6.2. With X(l) = X(A(l)), X(1) SI, y.(l^ = (l-./l- X(l) )/ (1+. 4- \(i)> ,
we have
79
\(i+D ^ X(i)PfJ_±-i£
Proof . It suffices to consider i = 1, so we xjill drop the superscripts.
Define e. = b ^ d. , T] . = b . ^ d , k = 0,...,p-l, so that
J,-k j-k j ,-k j > k J+*^ J ,k
= (-a, a5
j j,-l'v j-l “j-l 3,-2
),
.-1 . , ,-l -1 .
j- P+2 a j-p+2 '’j.-p+l
X (b"1 . a. ,,
J-p+1 j-p+1
(2) = r. ,
a • JP
3
t. = (-c . 71 . 1 .) (-b . 1 c
J J J,1 J+l J+l 3,2
X (b- 1 , a . ,) ,
v j+p- 1 j+p- 1
,-1 -.-1
j+P- 2 C j+p-2 j,p-l
c(2) - t. ,
J JP
h(2) - «
b , ■“ s . j
j jp
v - 1 , -1 1 , - 1
s . = b . - a. e . b. , c. , - c. T. b a
J J J J.-l J-l J-l J J.l J+l J+l
= b . (I - a.) .
J J
Now
’ 11 ^p-i11 3 1# ii 7ij!kii s (i ■ xit ^jlk+iii /A)"1- By definins ei = i.
Ek = (1 • XEk-l/4)"1> We haVe II ^Jp-kH * V Similarly, || * Efc.
This yields || c || * XE^/2, || (I - CTj)-1|| £ 2/(2 - \E ^ . To bound \(2)
observe that
4||b<2>-1a<2)ll II bP)-1c(2> "
4 I) (I - a. )
JP
j-l j-l"
'ln " \1 a. II II e:1 J| || b} , a.
jp JP 1 1 JP.-l 1 JP-1 JP
x [l ejP.-2,! ••• ” 'jJ.-pU11 11 'Wl “jp-P+11
x 11 (1 • w
-1
--1
X I 71
-1
JP-P.2 1
b” 1 c .
JP-P JP-P
JP-P»P“ 1 II II b JP- 1 C JP- 1 1
-1
jp-P.l" 11 jp-P+1 jp-P+1
s 4
( — 2-A (hp nr;E2.
V2 - XE J 4 j-l j
x p- 1
80
(2)
Thus is bounded by the above quantity. Suppose first that X - 1,
which implies that y, = 1. Then E^ = 2n ' (n+1) , the bound simplifies to
XP, and we are done with this case. Now suppose that X < 1. It may be
shown by induction that E^ 3 (1 + u.)(l - u.n) (1 - y,n+^) , yielding
(2/(2 - XE ))~ = (1 - X) ^((1 - ilP) (1 + u,P))""> and with a little effort
P” i
the bound simplifies to the desired result. QED
Note that we have not actually shown that X^ £ 1 implies ■£ 1.
This point is now to be resolved.
Corollary 6.2.
X^ < 1, and
If
(i+1)
u
s: 1 then \(i+1)
(i) p
, (i)P . ,
£ \ with strict
inequality if
Proof . The case X^ = 1 was considered in Theorem 6.2. Suppose that
0 \ < 1, again considering only the case i = 1. It is not hard to show
1 Lp/2J i
that ((1 + n),/2>P(2/(l + yP)> = (7. (X))" , where n (X) = (P.)(l ' •
p P ~i=0
Now, II (1) = 1, 7' (X) < 0 for 0 £ \ < 1, so n (X) > 1 for 0 ^ \ < 1 and
P P P
X^“^ < XP for this case. Assuming only \ £ 1, and since \(1 + y,)-/4 = u,
we have X^ S y,p(2/(l + y,P))% which implies y/^ ^ y,P . QED
6.C . Back Substitution
Error propagation in the back substitution for p-fold reduction de-
pends on the particular implementation. In the simplest form
*kp-j ^kp- j \p-j *kp-p ^kp-j Xkp ’
errors in
*kp- j
. depend on newly generated rounding errors and propagated
errors from x, and x, . We can write the computed solution as
kp-p kp
y, , = x, . + %. , to find
ykp-j Kp- j kp- j
81
'kp-j ^p-j 'kp-p 'kp-j ^kp + ekp-j"
It follows from the analysis of the Gauss- Jordan algorithm (Corollary 3.5)
and from explicit formulae for \p_j> j> that || l| £ || B[A]|f J,
II \p-jH S II B£A]|P if II BfAlll< which leads to
II Skp-jH * H »[AiirJll Vp«+ II alt’ll 5kPl,+ II V-jll-
Errors in a newly computed component therefore depend most strongly on
errors in its closest back-substituted component, and the "influence"
decreases as the distance between components increases. Moreover, we
have
II §kp-jl( s II B[A]H VpH’ II Spl^ + II V-jll*
If e is a bound on errors introduced at each stage, it follows that errors
in the final computed result will be bounded by e lo? N.
P
Slightly different results prevail in the back substitution
^p-l ^kp- 1 \p-l ^D-p ’Vp-l ^p
for j = 2 , . . . , p- 1
‘kp-j Fk,p-j Ak,p-j Xkp-p Ck,p-j ’Scp-j+l
Errors in the
computed solution
' kp-j
*kp-J + "kp-j obey the rule
5? pm
skp-l 'kp-l "kp-p " Vkp-p "kp + Skp-1
for j = 2 , . . . , p- 1
"kp-j " -Ak,p- j "kp-p " Ck,p-j ^kp-j+1 + €kp-j’
so the errors in x depend on propagation from x. and x, , , rather
KP J xp-p kp-j+1
than x^p.p and ’Stp’ This iroplies a linear growth as in the LU factorization
82
in conjunction with logarithmic growth as in odd-even reduction. If we
again suppose that || e^^jj <: e, with || ?kp_p |(, || ckp|| * ?, it follows
that
I! ?Up_ ! II * II B[£]|| ? + e,
II Skp.jH * || B[Xj || max(5, || lkp_j+1|| ) + e
which leads to || ?kp_j|| s % + (p-l)e. The end result is that errors in
the final computed result will be bounded by ep log N.
83
7. SEMIDIRECT METHODS
Briefly, a semidirect method is a direct method that is stopped be-
fore completion. In common usage, a direct method for the solution of
Ax - v computes x exactly in a finite number of steps if exact arithmetic
is used, while an iterative method computes a sequence of approximations
( i) *
x converging to x. The semidirect methods are intermediate on this
scale, producing an approximation to x in a finite number of steps,
while an additional finite number of steps could produce x exactly. During
the computation, intermediate results are generated which approximate the
solution. If the error is small enough then the algorithm is terminated
prematurely and an approximate solution is returned.
The method of conjugate gradients is an example of a semidirect
method since it can produce a good approximation long before its usual
termination [R2]. The accelerated Parallel Gauss iteration ([H4] and
Section 8.E) is another example, as is Malcolm and Palmer's convergent LU
factorization [M2] of A = (a,b,a) , a and b positive definite with o (b *a) < -7.
Here we consider semidirect methods based on the quadratic conver-
gence properties of the odd-even methods. This will generalize various
results obtained by Hockney [H5], Stone [S2] and Jordan [Jl]. Buzbee [B17]
discusses a similar technique based on the Buneman algorithm for Poisson's
equation (cf. Theorem 4.3).
The name semi- iterative has already been appropriated for other purposes
[V3], Stone [S2] initiated the name semidirect.
34
7 .A. Incomplete Elimination
Let us first consider odd-even elimination for Ax = v, which is the
process H1 = A, v(1) = v, H = (I + C[H.])H. , v(l+1)= (I + C[H.])v(l'1 .
In Theorem 4 . 1 we proved that if |j B[A]||< 1 then || B[Hi+^]|| ^ || B[H^]“ |j,
so that the off-diagonal blocks decrease quadratical ly in magnitude rela-
tive to the diagonal blocks. This motivates the approximate solution
z^^ = D[1L] ^v^\ Because we halt the process before completion, we
call this technique incomplete odd-even elimination.
Theorem 7.1.
x - z*-1'1 II / II X |
B[H.]||.
Proof . Since v^ = H.x = D[H^](I - B[H^])x, we have z^
x - B[H.]x.
D[H.]_1v(i)
QED
Thus the B matrix determines the relative error in the simple approxi-
mation z ^ ^ . If 3 = || B[A ] || < 1, 0 < e < 1 , m = ! log N” , then
(7.1) k= 1 + max(0, min(m.
log
l°g9 e
2 V log ?
))
guarantees |j B[R^]|| £ e. For example, if
a = —
f20 = 10-6, then
k = 6 and we should have N > 32 for incomplete odd-even elimination to be
2^ — 32 — o ^
applied. In fact, |j B[Vl^] || £ 3 =2 = 10 ”, so the results of the
incomplete elimination would be much better than required.
In the case n =» 1, a^ = c^ = a, b^ - b, we have || B[H^+^]|| *
II B[Hi ]2 }| / (2 - || B[H.]2|| ). Since x2/2 < x2 / (2 - x2) < x2 for 0 < x < 1,
k must be larger than
(7.2) 1 + log
log2(e/2)'
2l log2(3 2)
85
in order to have || B[H, ]|| £ e. (By induction, 3 > 2(9/2} , and if
2k_ ^
2(3/2) > e then 3 > e.) A more refined lower bound is obtained
2 2
by defining t = 3, t . ~ T (2 - t ) , and computing the largest k such
i rtf- in n
(7.3) xR > e .
Since B[1L ] is the matrix associated with the block Jacobi iteration for
H^x = v^ , when || B [ H^. ] | j is small we can use this iteration to improve
the estimate . pn fact, the choice z^ = of h ] * is the iterate
- iC
following the initial choice z ^ = 0. If z^k’^ is an approximation to
x and z(k»1-+1) = B[Hk]z(k,l) + D[Hk]’1v(k) then || x - z(k’^|| <:
|| B[H, ]^|| || x - z(k’°^ || . k and l may now be chosen to minimize computa-
2^ 1 £
tion time subject to the constraint 3 £ e. Any of the other standard
iterative methods may be used in place of the Jacobi iteration; for related
results using the theory of non-negative matrices and regular splittings,
see [HI],
We have remarked earlier that if the tridiagonal matrix A = (a., b., c.)
j j J
is irreducibly diagonally dominant and 9'
B[H^]|| then 3^ £ 1 and
^ 9<‘i'1 2 . This is a case of practical importance in differential
equations. However, the example A = (-1, 2, -1), for which 9^ = 1,
1 £ i ^ m-1, 3^ = 0, shows that strict inequality is necessary for incom-
plete elimination to be effective.
Indeed, when 3^ is very close to 1, as will be the case for most
second-order boundary value problems, the fact that we have strict in-
equality does not imply that 3V will be small for i < m. With regard
to the particular approximation z^ = D[H^] v^\ z ^ ^ depends only on
v £, j - (2i_1 - 1) £ l s j + (2 1-1 + 1) . Thus for N = 2m - 1, i < m,
86
( i.)
the center point 2^m ^ will be independent of the boundary data. Al-
though this is not the only possible approximation, complete odd-even
elimination will be necessary to ensure convergence to the continuous
solution .
7 ,B . Incomplete Reduction
The incomplete odd-even reduction algorithm is analagous to incomplete
odd-even elimination. Complete reduction generates a sequence of block
tridiagonal systems A^x^ = w , i = l,...,m, m = ~log? N+ 1~ , solves
, (m) (m) (m) , , (m) , , , „ . ,
Ax - w exactly for x , and back substitutes to finally obtain
(1) . , . , (k) (k) (k) ,
x = x . Incomplete reduction stops with Ax = w for some k be-
Ck) (k)
tween 1 and m, solves this system approximately for y *3 x , and back
substitutes y^ to obtain y = y^ rs x^ .
We remind the reader of observations made in Section 4.D, especially
that roughly half the actual work is performed in the first reduction, one
quarter in the second, and so forth. Thus the savings in incomplete reduc-
tion would be much less than those in incomplete elimination, where the
work per step is roughly constant. For more details, see Section 10. C.
In contrast to incomplete elimination, the incomplete reduction must
deal with error propagation in the back substitution. We have already
shown that roundoff errors in complete reduction are well-behaved, and
can only grow at most logarithmically. But since the approximation errors
quite possibly might be larger than the roundoff errors, a different analy-
sis is needed. We will see that a number of interesting properties result,
including no growth of errors under certain assumptions about roundoff,
*
This observation is due to an anonymous referee of [H2].
87
and a superconvergence- like effect due to damping of errors.
An initial approximation y^ may be computed from and w^ ac-
cording to our analysis of incomplete elimination. The choice
= D[A^^ ] lwOO is a simple one, and we have -y^'* [j ^ |(B[A^^]|
The parameter k should be determined both for computational economy and
for a suitable approximation. Here the quadratic decreases of B[A^ ] are
important and useful in some (but certainly not all) situations.
(k)
Regardless of the source of the approximation yv , let us suppose
y
(id-1)
(k) (k) (k) (i)
that y^ = x) + In the back substitution, y is computed from
by
y<» = y<i+1)
y2j Yj
(i) = (b(D )-l(w(i) . a(D d+1) . c(i) yCi+D) + e(i) ,
y2j-l C 2j-l; (W2j-l a2j-l yj-l C2j-1 yj ; + 2J-1*
with representing rounding errors. We then have
A(i) = (i+1)
2j
6(i) = e (i)
2j-l 2 j- 1
/h(i) x"l,,(i)
( 2j-l} a2 j- 1 >1
+c<« 5<i+1))
2j-l J
Theorem 7.2. If j| s(l) |j S (1 - || B[A(l) ] || )
'<i+1> II then |H(i) || - ||6<1+1)|t
88
As in the analysis of roundoff error propagation for complete odd-even
reduction, we have
i(i-) II /II
|| 5^|l ^ax(|| 6^' ||,
s(l)||+ || B[A(l)]|| || 5(l+1) || ).
From this expression and the hypothesis we have || 6^ || £ || |j .
But, also by construction, j| 5^ || ^ || || .
QED
This theorem requires some explanation. First suppose that no round-
ing errors are committed, so that = 0 and all the error in y^ is
.00
derived from the approximation error in y . It now follows from
|j B[ A ] || < 1 that this error will not grow during the back substitution.
From | x
00
s || X || , we conclude that ||x-y || / ||x || ^ -y ^ || ' ||
In fact, it is easy to see that while 1 1 x — y|| = || x^ - y <'k'> ||, many
vector components of x-y will be much smaller in size than J| x - y j j .
In Figure 7 . 1 we illustrate the errors by component for the tridiagonal
system (-1, 4, -l)x = v, >1 = 31, k = m-1 = 4, v chosen so that Xj = 1,
and y^ = b^k^ “w^k^ . In the absence of rounding errors, the back sub-
stitution quickly damps out approximation errors, creating a very pleasing
superconvergence- like effect.
Now suppose that rounding errors are committed and we have the upper
bound
.(i)
£ e. If
;(k)
^ ve for some moderate constant v then we
have essentially solved the system A^x^ = w ^ exactly, and the error
propagation analysis is the same as for complete reduction. That is, we
(i)
have || || ^ (k - i + v)e, 1 ^ i s k.
;(k)
The remaining case is when || || » e. Here the semidirect method
must be considered in competition with an iterative method. The hypoth-
esis of Theorem 7.2,
6^ || (1 - || B[A^ ] || ) 5 e, should be viewed as
90
an attempt to formalize the condition j| 5^ || » e . As long as the hypoth-
esis remains true, the difference (in norm) between x^ and y^ can en-
tirely be attributed to the original approximation error. As soon as the
hypothesis is violated it is possible for the errors to grow, but we will
still be able to say that || x - y|| = || 6^ || +• 0(ke) . Although the super-
convergence effect would be moderated somewhat by roundoff, its qualitative
effects will still be felt and we can expect this bound on accumulated er-
ror to be an overestimate.
7 . C. Applicability of the Methods
We close with some remarks on the practical application of the qua-
dratic semidirect methods. In Buzbee's analysis [B17] of the Buneman algo-
rithm for the differential equation (-A + d)u = f on a rectangle with
sides c ^ and c , da nonnegative constant, it is shown that if
1 2
T2 ^ 2
+
_2
] » 1
then a truncated Buneman algorithm will be useful. Geometrically this re-
quires that the rectangle be long and thin if d is small, and hence the
block tridiagonal matrix will have small blocks when properly ordered.
By directly considering the finite difference approximation to Poisson's
equation on a rectangle with equal mesh spacing in both directions it is
possible to reach a similar conclusion; this also serves as a simple model
for more general elliptic equations. With M nodes in one direction and N
in the other, M < N, the matrix A is (-1^, P^, - 1^) ^ , P = (-1, 4, -1) .
It is readily verified that || B[A]j|= 2 | j QM 1 1 , = P^1,
AD-A046 625
UNCLASSIFIED
CARNE6 I E-MELLON UNIV PITTSBURGH PA DEPT OF COMPUTER —ETC F/6 12/1
OIRECT AND ITERATIVE METHODS FOR BLOCK TRIDIAGONAL LINEAR SYSTE— ETCCU)
APR 77 D E HELLER N00014-76-C-0370
91
\k
l D + D 1
_ rn m- i.v «
-(1 ), n = 2m,
2D
j(l - n = 2nrt-l,
a
»0' l> Dl'4> Dn'4 Dn-1 “n-2 *
The following table gives j| B[A]|| for M = 1(1)6, along with estimates
on the number of odd-even reductions needed to obtain || B[A^ ]|| < 10
k such that ]| B[A^^]||< 10 ^
M
1
2
3
4
5
6
B[A ] j
2 = -5
! * -67
| = .86
If*-
If* -
If*-
lower bounds
(7.2) (7.3)
upper bound
(7.1)
8
9
10
11
These figures again lead to the conclusion that the best use of the semi-
direct methods for elliptic equations will be with long thin regions. In
the next section we consider some iterative methods which use matrix split
tings and orderings to take advantage of this property.
92
8. ITERATIVE METHODS
In this section we consider the iterative solution of a finite dif-
ference approximation to a general nonseparable elliptic equation. Of
particular interest will be methods which employ, as part of each itera-
tive step, the direct or semi-direct solution of a block tridiagonal
linear system. Continuing our emphasis on methods for a parallel computer,
we want this system to be such that the method already considered can be
used efficiently.
8 .A. Use of Elimination
Let us first note that several of our theorems about block elimination
have applications to iterative methods. Consider the system Ax = v and
the block Jacobi iteration * B[A]x^ + D[A] ^v. Suppose that
|j B [ A ] 1 1 < 1, so the iteration converges; this condition is satisfied for
many elliptic equations with suitable partitioning [V3], If the structure
of A can be simplified by an elimination step, forming A'x = v1 , then
subsequent computations will be handled more efficiently, and since
jj B[A' ]j| s jj B[A]|| the iteration will not converge any slower.
Indeed, convergence of the block Jacobi iteration will often be quite
slow. When A is also positive definite the Jacobi semi- iterative method
(J - SI or Chebychev acceleration) [V3]
x(l+1) = vi+1(B[Alx('l) 4- D[A]_1v - x(l_1))+ x(l"lV
vx - 1, v2 - 2/(2 - o ) ,
V.+1 = (1 - P2v1/4)“1, o = o(B[A]),
will converge at a much faster rate and is a natural candidate for parallel
93
computation. If we use 0 = || B[A]|| as an upper bound on p and use the
sequence of parameters
* * , ?
v1 = 1, v2 = 2/(2 - 0Z),
Vi = (1
2 * , - 1
3 v./4) \
then the J-SI iteration for A' will not be slower than that for A. By
"preprocessing" Ax = v we can expect that the actual computation time will
be smaller; unfortunately a complete proof of this claim is not at hand.
Juncosa and Mullikin [J2] discuss the effect of elimination on itera-
tive methods when A = I - M, M 2 0, m = 0, and m SI with strict
s
inequality for at least one value of r. Their particular interest for el-
liptic boundary value problems is to order the boundary nodes followed by
the interior nodes, and then to eliminate the boundary nodes. This would
be useful for domains with curved or irregular boundaries. It is shown
that, with A' = I - M' representing the system after eliminating one point,
p(M') £ p (M) . Thus the point Jacobi iteration will not be slower. Also,
if G = (I - L[M]) ^ U[M] is the matrix describing the Gauss- Seidel itera-
tion, then o(G') ^ p(G). Similar results follow from work on G-matrices
[B4]. As in the proof of Corollary 3.2 we obtain corresponding results
for block elimination, though not under the more general assumption
II B[A] || < 1.
8 . B. Splittings
The block Jacobi iteration is created by "splitting off" the block
diagonal portion of A, and using the iteration D[A]x^+^ = (D[A]-A)x^ + v.
Many other iterative methods are also defined by a splitting A = - M2 ,
MjX*'i+^ = + v. When is a block tridiagonal matrix, various direct
and semidirect methods can be used to solve the system for x^
94
A particularly effective procedure discussed by Concus and Golub
[C4] is readily adapted to parallel computation. The equation
s -7 • (a(x,y)vu) = f
on a rectangle R, a(x,y) > £) and sufficiently smooth , ^undergoes a change
~2 ~~2
of variable w(x,y) = a(x,y) u(x,y) and scaling by a to yield the equiv-
alent equation
- Aw 4- p (x , y) w = q
.11 1
~ 2 2 ~ 2
on R, p(x,y) - a A(a ), q = a f. The shifted iteration (-A + k)w ,
n-r I
= (k - p)wn + q, k a constant, is solved in the discrete form
(-A^ + KI)w^n+^ = (KI - P)v(n) + Q, where P is a diagonal matrix. Since
the system (-A^ + KI) b = c may be solved by the fast Poisson methods and
Chebychev acceleration is applicable, the overall method is highly paral-
lel and rapidly convergent.
8.C. Multi- line Orderings
There are a number of more classical iterative schemes which involve
solution of block tridiagonal systems. We begin with the multi-line itera-
tions for a rectangular grid; the ordering of unknowns and partitioning
for 2- line iteration on a 6x6 grid is shown in Figure 8.1. A diagram of
the resulting matrix is shown in Figure 8.2.
The underlying technique of the multi- line methods is to divide a
square or long and broad rectangle into a set of long and thin rectangles.
A discrete form of the elliptic equation is then solved over each subrec-
tangle as part of the iteration. By partitioning the grid into groups of
a few lines we obtain systems that can be solved well by general block
■■■■
97
methods, particularly the semidirect methods of Section 7.
The multi- line block Jacobi iteration with Chebychev acceleration
is probably the simplest method to implement on a parallel or pipeline
computer and should be considered as a benchmark for other methods. With
an nxn grid and a k-line iteration, n = kn' , the splitting A = is
such that consists of n' decoupled block tridiagonal matrices, each of
which has kxk blocks with N =* n. By storing zeros in strategic locations
may be regarded for computational purposes as one block tridiagonal
2 ,
matrix with kxk blocks and N = nn' = n ,/k. Lambiotte [LI] points out
that only log n odd-even reductions are needed because of this construc-
tion, rather than log N reductions as for the general case. When incom-
plete odd-even reduction is used, the number of lines may be chosen ac-
cording to the guidelines given in Section 7.C for Poisson's equation.
8 .D. Spiral Orderings
Another useful ordering of unknowns for parallel computation is the
single spiral order diagrammed in Figures 8.3 and 8.4. The spiral order
essentially produces a very long and thin rectangular region, which is
represented by the tridiagonal matrix forming the central portion of the
matrix in Figure 8.4. In the splitting A =■ - M0 we take to be the
2 2
tridiagonal center of A. For an nxn grid is n xn and has || B[M^]||
about thus incomplete odd-even reduction can be used quite efficiently
to solve systems involving .
We note that the single spiral order has been used with another split-
ting to yield the block peripheral method [B9], [E2]. Here the blocks are
of the periodic tridiagonal form, and the convergence rate of the periph-
eral block SOR method is shown experimentally to be similar to the 2-line
block SOR method [BIO],
98
The double spiral ordering (Figure 8.5, 8.6) is an attempt to com-
bine features of the single spiral order with advantages of a 2-cyclic
1 2
ordering. For an nXn grid, n odd, one spiral contains -(n + 2n - 1)
1 2
points and the other contains — (n - 2n + 1) points. Multi- spiral or
-peripheral orders can also be defined.
For the single and double spiral orders, when n is even the trailing
submatrix of A looks like
The *'d elements of A should be included in the implicit part of the itera-
tion, M^ . This can be done at little extra cost, and ensures that each
row and column of has at most two non-zero elements. When n is odd a
similar modification should be made.
The mapping from the natural ordering to the peripheral or single
2
spiral ordering is defined by a permutation of the vector T = (l,2,...,n )
2
into a new vector S = (S(l) ,S(2) , . . . ,S(n )). With an nxn grid, point (i,j)
is unknown number (i-l)n + j in the natural ordering. Given T, two vec-
tors forming the coordinates (i,j) of each unknown are computed by taking
quotient and remainder:
q(k) = L(k-l)/nJ, (1 £ k s n2)
i(k) = 1 + q(k), (1 £ k £ n2)
j(k) = k - nq(k) , (1 £ k £ n2) .
For simplicity let n = 2m- 1. Define
P ( i » j)
min(i,j, 2ra-i, 2m- j) , (1 £ i, j £ n) .
The point (i,j) is part of peripheral number p(i,j), 1 ^ p(i,j) ^ m.
100
Peripheral p has 8(m-p) points in it, except for peripheral m which con-
tains the single point (m,m).
The peripheral ordering is per. 1, per. 2, . .., per. m, where each
peripheral is taken in clockwise order: top, rhs , bottom, lhs. Define
P-1
f(i,j) = 8 (m-k) = 4(2m-p) (p-1) ,
‘ls-1
the number of elements contained in peripherals 1,2 , . . . ,p(i, j) - 1, and
g(i.j) = f(i,j) +
(i-p) + (j-p) +1 if i ^ j,
8 (m-p) - (i-p) - (j-p) +1 if i > j
= 4p(2m-p) + 1
-8m + (if i £ j)
- (if i > j)
(2p + i + j)
The vector S (k) = g(i,j) defines the necessary permutation, and can be
easily evaluated on a parallel or pipeline computer. The actual reordering
of a vector may be performed on the CDC STAR by using the built-in vector
permute instructions, though these should be used sparingly since they are
quite expensive. Once the vector S is generated, the elements of A and v
may be generated directly.
Of course, the implementation of an iterative method based on an
"unnatural" ordering involves much more developmental attention than is pre-
sented here. We only mean to show the potential for parallel computation.
A more complete comparison based on actual machine capabilities has not
been performed. For such a discussion of some other methods, see Lambiotte
[LI].
101
8 .E. Parallel Gauss
We close our section on iterative methods for block tridiagonal sys-
tems with some brief remarks on the Parallel Gauss iteration due to Traub
[T2] with extensions by Heller, Stevenson and Traub [H4]. For simplicity
we consider the normalized form A = (a^, I, c^) .
In the block LU factorization A = ( , I, 0) (0, d , cj) > let
D = diag(d^, . . . ,d ) . Since D = I - L[A]D Hl[A], a natural parallel itera-
tion is = I - L[A](D^ ^ ) Hj[A], where is some initial estimate
of D. This forms the first of three iterations making up the Parallel
Gauss method; the other two parts are Jacobi iterations for the L and U
block bidiagonal systems. It is shown in [H4] that, with \(A) ^ 1,
u(A) - (1 - vTrX)/(l + , we have || D - D(0 || £ u|| D - D(l_1) || .
Techniques for reducing the constant p, are given in [H4]. As Stone [S2]
points out, this linear convergence is much less attractive than the qua-
dratic convergence of odd-even reduction. However, in certain situations
(parabolic equations or the multi- line iterations discussed here) it is
necessary to solve a sequence of related problems, and this would hope-
fully provide suitable initial estimates.
Let A' = (a\, cj^ an<* suPPose that D' has already been computed.
Since d'. - d. = a, d * c. . - a'.(d'. .) ^c' , , by Theorem 3.4
J J j j-1 j-1 J J-l j“l
d'. - d . II ^ (h (• -j—
j J" v4 V1 + ,/1-V
1
2r-r) + (r~) (
l + .a-v
= 1 - jCvO-X + • ,1-V).
Thus || D ' - D || = max j || dj - d^ || 5 1 - -j(> fl- X + , ^1- X' ) , and we may take D'
as an initial approximation to D. Now, the Parallel Gauss iteration is
102
most useful when X and X' are not too close to 1, and in this case good
initial approximations will be available by the above bound on J | D ' - D | j .
When X and X' are close to 1 the bound will be large, but in this case
the iteration should probably not be used due to slow convergence.
For tridiagonal matrices we have
so that
Id1. - d
1 j ;
di
dj - dj ‘ ' Vl,dH 'j-!
- (di-irl(ij c>i ■ ai 'j-P-
1 - ./l-X
- d.
i + 777 1 j-i j-i
+
1 + - 1- X'
a c
. 1 « .v. . 1 ■
J J-l J J-l1
Let A = max . I a .c . - a’.c'. I < (X + X1 )/4 , 6 . = Id' - d | ,
J J J-l J J-l1 J 1 j j ’
a = (1 - ..T7x)/(1 + ./l-X') , b = 2/(1 + - '1- X' ) . Then *> = 0, 5. s a 5. , + b,
1 J J-l
so by induction 5^ £ (1 + a + . . . + a^-2') b < b'' (1-a) , or 2tJ (» /l-X + yl-X’ )
By the intermediate value theorem (yl- X + . ;1- X’ )/2 = .4-X’ for some \ be-
tween \ and V , so that 5^ ^ A/-/1-X . Again we have the situation where,
-'<■
if X is close to 1, a large upper bound results, but here the iteration
should not be used. When X is not so close to 1 the bound may give more
complete information than the one derived for block tridiagonal systems.
Let us now consider the utility of the Parallel Gauss iteration for
the parabolic equation ut = (au^)^, a = a(t,x) > 0, u = u(t,x), with
0 s x SI, 0 St, and u(0,x) , u(t,0) , u(t, 1) specified. Let u1 c* u(ik, jh) ,
2
k = At, h = Ax, p = k/h . The Crank-Nicolson method for advancing to the
next time step must solve the tridiagonal system
/T , P ni+l\ i+1 /T P „i. i
(I + - P ) u = (I - I p )u ,
P1
(- a
j- 1/2* °j-l/2 + aj+l/2 * * aj+l/2
/o) •
103
We therefore have
a — / i+1 , , i+1 i+1 \ /o i+1 i~\
A Daj-l/2’ 1 + P(aj-l/2 + aj+l/2) " Paj+l/2' 2);
let us now drop the superscript i+1 for convenience.
A will always be strictly diagonally dominant because a(t,x) > 0,
and we would like to know when we also have \(A) ^ 1. Using the mean value
theorem, define x. and x. by
J J
aj-i/2 + aj+i/2 - "aj-l/2 " “ °y
bj = ax((i+l)k, Xj) ,
a 3-2/2 + aj- 1/ 2 = 2a j- 1/2 ‘ h V
f? . = a ( (i+1) k, x .) .
J x J
After some algebraic manipulation it is seen that
\(A) = max.(l + (l + -rhb,)) ^
J paj- 1/ 2 2 J
x (1 + (1 - ■§ h S.))”1.
paj-l/2 2 J
In order to guarantee \(A) s 1 it is sufficient to assume th. t h and k satis-
fy ~ h|ax(t,x)| < 1. If m^ = max|ax(t,x)| this becomes the condition
m^k £ 2h.
_2
Suppose we want the more stringent condition \(A) £ (1 + c) , c £ 0.
This may be satisfied if we choose h and k such that, for each j,
2a . / + h b . ,
CPaj-l/2 S 1 + Dhbj 2>
cpa j 2 £ 1 - phH^ ' 2 .
With ra^ = max a(t,x), a sufficient condition is cpm^ s 1 • Phm^ 2, or
104
(8.1) k £ 2h2/(2cm0 + hn^) .
For c = 0, (8.1) reduces to the previous condition on h and k, while for
c = h we have k £ 2h/ + m^) = 0(h) .
Now, one of the best features of the Crank-Nicolson method is that
k may be chosen to be 0(h) . We have seen that a small value of c still
allows this choice. Conversely, if k = rh satisfies (8.1) and rm^ < 2,
then c £ h(l - rm^/2)/ (rm^) - 0(h). When we try to pick h and k so that
the Parallel Gauss method will be effective (c » h) we are necessarily
restricted to small time steps. When h and k are chosen in the usual way
(k = 0(h)), Parallel Gauss cannot be expected to be very effective since
\(A) will be close to 1. Thus the method appears to be less attractive
for this particular application than originally believed, and odd-even
reduction would be preferred. Once several reductions have been performed,
\ and u. may be sufficiently small to warrant use of Parallel Gauss and its
variations .
9. APPLICATIONS
In this section we discuss two applications of practical importance
in which the assumption || B[A]I| < 1 is not met; in the second case it is
shown how to get around this problem.
9 .A. Curve Fitting
Cubic spline interpolation of the data (x^,f^), 1 £ i £ N + 1, gives
rise to the tridiagonal system [C6, p. 238]
h.s. + 2 (h . + h . . ) s . + h . .s,,.
l i-l v l L-l 1 1-1 1+1
= 3(f[xi_1,xi]hi + f[xi>xi+1]hi_1) .
2 £ i £ N,
h. = x
l i+l
x. .
i
We have A = (h^, 2(h^ + h^ ^) , h^ and |( B[A]|| = — , which is quite satis-
factory. However, more general problems need not behave so nicely even
though they might give rise to block tridiagonal matrices.
Cox [C7] presents an algorithm for least-squares curve fitting with
piecewise polynomials which must solve the system
The v vectors contain coefficients of the Chebychev polynomial expansion
106
of the piecewise polynomial, and the X vectors contain Lagrange multipli-
ers derived through a constrained linear least-squares problem. The C/s
always have full rank, and if there are sufficiently many data points
between adjacent pairs of knots then the H/s will all be positive defi-
nite. In this case a s traightf orvard band elimination may be used well
when the number of knots and data points is large. When insufficient data
is available the H s will not have full rank and pivoting is needed in
the band elimination. It seems that the condition j| B[A]||< 1 will not
hold for any reasonable partitioning or data. Indeed, B[A] is not even .
defined for the most obvious partition.
9 .B . Finite Elements
In Section 2 .A we noted that matrices constructed from finite element
approximations to differential equations often do not satisfy assumptions
such as jj B[A]|| < 1. Thus some other technique must be used to establish
numerical stability in a direct method or convergence in an iterative
method. Usually it is shown that A is positive definite. The following
example, admittedly a simple one, shows how an easily implemented change
of variables can yield a new system A'x' = v' such that || B[A' ]|| < 1.
Even if the change of variables is not actually made, this will show that
II B[A] It*, < 1 for some norm || • ||^, and also serves to establish numerical
stability.
Let us begin with the equation -u" = 1 on the unit interval. Using
Hermite cubics, Strang and Fix [S4, p. 59] derive the 4th order accurate
finite element equations
107
, u . , , - 2u . + u . , . u' - u'
-4(-l±i r1 ti) + h - J'h - 1,
(9.1)
, u , , - u . ,
-I/_ J±i 111
5 ' 2h
2h
> + I"j - Hi ' 2ui + "i-i*
= 0
Taking the natural choice x^ = (u ^ , u^) , we obtain the block equation
aXj_1 + bxj + cXj+1 = (1,0)
The matrix A = (a, b, c) is positive definite, but || B[A]|| = — + 3 ' (4h) .
The problem is simply that the unknowns are out of scale.
A second set of unknowns, which are in scale, is
T h h T T
x. = (u. - TT u'. , u. +- u'.) aa (u. u.ll/o) . Since the Hermite cubi
j j 2 j* j 2 j J-12 j+1/2
can be derived from B-splines by coalescing adjacent nodes, these unknowns
are actually a more natural choice than the original. We now have
res
'1 -r
,1 1,
'1 0
v0 h/
u .
J
and, multiplying the first equation of (9.1) by h and the second by h,
2 T
aXj ^ + bXj + cxj+^ = (h • 1, h • 0) ,
108
The blocks are independent of h and we have |! B[A]|j = 1, with o^(B[A]) < 1
for the first and last block tows. (Actually, this assumes Dirichlet
boundary conditions ; Neumann conditions give p^(B[A]) = 1.)
For the more general equation -u" + Qu = f, Q a constant, the follow-
ing equations are derived [S4, p. 59]:
-L (.36 u.^ - 3h u'_L + 72 u . - 36 uj+1 + 3h uj+1)
+ §0(54 Uj-1 + 13 h uj.! + 312 u. + 54 u.+1 - 13h u^)
= F j *
io(3 uj-i -huj-i + 8h uj • 3 Vi -by
+ io(-13 uj-l - 3h uj-l + 8hu^ 13 uj+1 - 3 h u'+1)
= F'..
J
Again taking = (u^ “ ^ u j > Uj + ^ uj)T anc* suitably scaling the equa-
tions, they become
ax . . + bx . + cx . , = (hF.,F'.)
J-l J J+l J j
2
where now the blocks are 0(h) perturbations of the blocks in (9.2). Omit-
5 2 U
ting details, for h sufficiently small this gives || B[A]|| = 1 - — Qh + 0(h ).
109
Since rescaling and change of variables in the linear system is equiv-
alent to a change of basis in the original function subspace, the process
should be applicable to more general differential equations solved by
finite element methods. The underlying technique is to prove something
for spline approximations, and show that the property carries over in the
collapse of nodal points yielding the Hermite cubic approximations.
Strictly in terms of modifying linear systems by a block diagonal
scaling, it is natural to ask if either of the following problems are
solvable in general:
1. Given an arbitrary block tridiagonal matrix A, find a block
diagonal matrix E such that || B[AE]|| < 1.
2. In 1., if || B[A] || < 1, find E such that || B [ AE ] 1 1 < || B[A]||.
These problems are closely related to the optimal diagonal scaling of
matrices to minimize condition numbers, and as such are outside the scope
of this thesis.
110
10. IMPLEMENTATION OF ALGORITHMS
This section contains some general remarks on the choice of algorithm
for particular applications, and a more detailed comparison of algorithms
for use on a vector computer. In the comparison we assume uniform nxn
block sizes, later specialized to n = 2. All algorithms will use the same
data layout, assumed to be core-contained, and the execution time compari-
son will be in two parts: first for a very simplified and idealized model
in which the machine vectors are defined as storage locations in arithmetic
progression, and second for a model in which the machine vectors are defined
as contiguous storage locations (arithmetic progression with increment = 1) .
The first model uses information for the CDC STAR- 100 ignoring, among other
things, the fact that machine vectors on STAR must satisfy the conditions
for the second model. This double comparison allows us to pinooint the
effect of data manipulation required by the stringent requirement on vec-
tor operands. For a description of STAR facilities necessary for this dis-
cussion, see Section 2.B and Appendix A. Implementation of algorithms for
tridiagonal linear systems, n = 1, is adequately discussed elsewhere ([H4],
[Jl], [L2], [Ml], [S2]) and these references may be considered as back-
ground material for this section.
10. A. Choice of Algorithm (Generalities)
The salient point of our analysis has been that a number of algorithms
may be applied to block tridiagonal linear systems that are likely to arise
in practice. Sometimes it may be possible to take advantage of special
features in order to reduce computing costs or estimate error. When the
computing costs are high (such as one enormous problem or a large problem
solved repeatedly) an extra effort must be made to choose the most
Ill
efficient algorithm. This choice will be difficult to make in general,
but we feel that it is not at all difficult to choose good algorithms
in many cases.
Regardless of the computer involved, algorithms for three special
problems should be implemented before proceeding to algorithms for more
general problems. The first should deal with general block tridiagonal
matrices restricted to blocks of a small uniform size, say no more than
8x8. On a standard sequential computer, band or profile methods are ap-
propriate; on a k-parallel computer some form of block LU factorization
should be used, and on a vector computer a combination of odd-even reduc-
tion and LU factorization should be used. Storage allocation and access,
vector operations on a parallel or vector computer and implementation
issues in general can be greatly simplified under the block size restric-
tion. Options include implementation as a set of subroutines for specific
block sizes, insertion of tests for use as a semidirect method, and spe-
cial scalings or pivoting strategies for efficiency or stability. This
restricted class of algorithms may be applied directly to small-block
problems or used as a module in the iterative solution of large sparse
problems as indicated in Section 8. The experience gained may or may not
be useful for implementation of methods for moderate-sized blocks, which,
in our opinion, should be treated in a different manner.
The best choice of a parallel algorithm for moderate- sized blocks is
not clear at this time. These cases represent a cross-over between effi-
cient vector implementations of band/profile methods and the odd-even
methods. Either one is likely to be satisfactory. In Lambiotte's opinion
[LI] the band methods are better for the CDC STAR, but this choice is
heavily influenced by vector addressing restrictions on the STAR (cf. our
■
112
remarks in Section 10. B) . Other parallel or vector computers may not
present such problems, and a more complete analysis is needed.
The third basic class of algorithms are the fast Poisson solvers,
including generalizations. Recent studies have shown that many finite dif-
ference approximations to elliptic equations can be computed very effi-
ciently by iterative or modification methods based on the Poisson solvers.
There are several kinds of algorithms to choose from, and all have consid-
erable inherent parallelism. Other algorithms are available for elliptic
equations, but flexibility and ease of implementation suggest that Poisson-
based algorithms should be among the first to be considered. Brandt's
multi-level adaptive methods [B131 must also be seriously considered.
Finally, we note that matrix partitioning and block elimination is a
useful technique for dealing with matrices so large that secondary storage
must be used [R4]. Known as the hypermatrix scheme in other contexts, it
has been extensively exploited by the A SKA system, a structural analysis
package developed at the University of Stuttgart [F3]. Noor and Voigt [Ml]
discuss implementation of hypermatrix schemes on the CX STAR- 100. The
method works as follows: suppose a partition tt is given (see Section 2. A),
and let A^ be the matrix A partitioned accordingly. Block sizes in ASKA
are chosen to optimize transfers between storage levels, while the choice
on STAR must also try to maximize vector lengths and hence minimize start-
ud costs. The usual situation is that a few blocks of A will be in main
TT
memory, while the bulk of A^ will be in secondary memory. For example, if
tt' is another partition such that rr refines rr' , we would perform block
elimination on A^, x^, = v , . When A^ is block, tridiagonal so is A^, .
Since the data requirements for the LU or Cholesky factorization are then
highly localized, this is a natural procedure [E3]. Moreover, the diagonal
113
blocks of A^, will themselves be block tridiagonal, though possibly of low
order. As noted in Sections 2 .A and 3. A. 2, if || BfA^] || < 1 then || B[A_r,]||
£ || jfA^]!!, and pivoting will be restricted to those blocks of A , resi-
dent in main memory. This, of course, is precisely what is desired.
10. B. Storage Requirements
Data layout is an essential part of planning an algorithm' s imple-
mentation on any parallel computer, since computational speed can be
severely decreased if data cannot be accessed conveniently. On a vector
computer this means that conceptual vectors must conform to the machine's
definition of a vector operand. We consider some consequences of this
requirement for implementation on the CDC STAR- 100, assuming that the prob-
lem is core-contained.
Storage available on the STAR, for 64-bit words, consists of 256 reg-
isters and either 512K or 1M words of main memory. A sizable secondary
disc storage is also available but will not be considered here. A vector
on STAR consists of s consecutive storage locations, 1 £ s £ 64K.
Our data arrangement for block tridiagonal linear systems with uniform
nxn blocks is illustrated in fig. 10.1 for the case n = 2, N = 7. Define
m =* rlog7N" , N* = 2m, and if N < N* then extend Ax = v with N* - N trivial
equations x^ =* 0. The matrix A and vector v can then be held in three
2
storage vectors of length n N* and one of length nN*. Taking the subdiag-
2
onal blocks as an example, the n components of a, are a. . , 1 £ i s n,
k 1 j , k
lsj5n, 1 s k £ N*, and the storage vector a is defined by
a((i - l)N*n + (j - 1)N* + k) =* a , . Thus a total of (3n^ + n)N* loca-
l j , K
tions are needed to store the original system. We do not consider the
cost of setting up the storage vectors since each of our algorithms will
114
a
b
c
V
an,i = 0
bH,l
cu,i
vl,l
3 11 >2
bU,2
Cll,2
Vl,2
ail,3
bn,3
CH,3
Vl>3
*11.4
bll,4
Cll,4
Vl,4
aH,5
b.l,5
CH,5
Vl,5
all,6
bll,6
C 1 1 , 6
Vl,6
all,7
bH>7
Cll,7 = 0
Vl,7
a11.8 " 0
bll,8= 1
Cll,9 = °
Vl,8=°
a12,l = 0
bU,2
C 12 , 1
V2,l
al2,2
b12,2
C 12 ,2
Vo 9
al2 ,3
b12,3
C12,3
V9 3
al2,4
b 12 ,4
C12,4
V2,4
a!2,5
b12,5
C12,5
V2,5
a!2,6
b 12 , 6
C 12 , 6
V2,6
ai2,7
b12,7
C 12 , 7 ~ 0
V2,7
al2,S = 0
b12,8 = °
C 12 ,8 = °
a21,l =
0
b21,l
C21,l
a21,2
b21,2
C21,2
a21,3
b21,3
C21,3
a21,4
b21,4
C21,4
a21,5
b21,5
C21,5
a21,6
b21,6
C21,6
a21,7
b21,7
C21,7 =
0
a21,8
0
b21,8 = °
C21,8 =
0
a22,l =
0
b22,l
C22,l
a22,2
b22,2
C22,2
a22,3
b22,3
C22,3
a22,4
b22,4
C22,4
*22,5
b22,5
C22,5
a22,6
b22,6
C22,6
a22,7
b22,7
C22,7 =
0
a22.8 =
0
b22.9 = 1
C22,8 =
0
Fig. 10.1. Data layout for n=2, N = 7 , N* = 8 .
115
use the same arrangement and since the details can very widely between
systems. This particular data layout has been chosen because it makes the
odd-even reduction algorithm more efficient for the true STAR model, while
the other competitive algorithms (LU factorization, odd-even elimination)
are mostly unaffected by the choice of data layout, and can do without
extra storage for trivial equations.
Since the STAR forces vector lengths to be £ 64K, we must have
?
n“N* ^ 64K = 65,536. For a given value of n, this restricts the system
size as follows:
n
2
3 j
1 4
| 5
1 6
| 7
8
64R/n2
16384
7281
4096
| 2621
1826
1337
1024
max N*
16334
4096
1
4096
2048 1
1 1
1024
1
1024
1024
These will probably not be serious restrictions in most cases.
A more important restriction on system size is our assumption that the
problem and solution will remain core-contained. Assuming a 512K main
store, and supposing that temporary storage plus the program amounts to T
times the original storage, we obtain the following restrictions on N:
n
2
3
4
5
6
7
| 8
512K T- 1
18724
8738
5041
3276
2279
1702
1310
(T 4- 1) (3n + n) T = 2
12483
5825
3360
2134
1533
1134
873
T = 3
9362
4369
2520
1638
1149
851
655
max N* T ■ 1
16384
8192
4 096
2048
2048
1024
1024
T = 2
8192
4096
2048
2048
1024
1024
512
T = 3
3192
4096
2048 I
.
1024
1024
512
512
116
1
The amount of temporary storage used depends on the algorithm and the
extent to which A and v are overwritten. It can range from practically no
9
temporary storage (other than 0(n“) registers) for the LU factorization to
T sa m for an inefficient implementation of odd-even elimination. Odd-even
reduction would typically have 1^2.
We now consider the costs of data manipulation for the basic algorithms.
Major considerations are the speed of the load store unit on STAR, for reg-
ister-memory transfer of scalars, and the requirement that vectors be stored
as contiguous memory locations.
The LU factorization, executed mostly in scalar mode: The only data manip-
ulation is to load store between memory and the register file. To a con-
siderable extent this may be overlapped with arithmetic operations. A
complete analysis requires cycle counting in an assembly language program
and timings from actual runs. Our estimate of execution time (based on an
untested cycle count) is that the algorithm will probably be I/O bound.
With this in mind, we suggest four possible implementations: (Times given
are for n = 2, and represent minimal run times assuming complete overlap
between scalar arithmetic and load stores. Actual run time will certainly
be greater. For details, see Appendices A, B.)
Version 1, a one-time solution, all scalar mode
time a 50SN - 362
Version 2, factor + solve, partial vector mode
factor, version a, time a 324N + 3421 (use for N > ~ 38)
version b, time a 374N + 1769
version c, time a 420N - 230 (use for N < — 33)
solve, time a 303N + 1039
4
117
1
The three versions of factorization use increasingly fewer vector opera-
tions and increasingly more scalar operations. If more than one system
must be solved with the same matrix A, it is best to use Version 2. Our
execution time comparison in Section 10. C will use Version 1.
Odd- even elimination, executed in vector mode: An important feature of
this algorithm and any data layout like ours based on component vectors
is that the conceptual vectors all correspond to the strictest require-
ments for machine vectors, and hence no additional data manipulation is
needed. (cf. [Ml] for the tridiagonal case.)
Odd- even reduction, executed in vector mode: If the vector computer model
allows vectors to be storage locations in general arithmetic progression,
then no additional data manipulation is needed. In a true model of the
STAR, however, we need to perform two basic operations: (1) separate a
vector into its even and odd components, and (2) rejoin the even and odd
parts of a separated vector. The first is accomplished with two STAR
compress instructions, the second with a STAR merge instruction. The odd-
even separation is needed before each reduction step can be performed, and
must be applied to vectors a, b, £, and v. We illustrate the separation
2
for b in fig. 10.2. Having set up a control vector z consisting of n N*' 2
copies of the bits 10 (a single vector operation on STAR) , we execute
compress b -> bl(l) per z
b(2j - 1) - bl(j) , (lSjS |n2N*)
12
compress b -* M(— n N*) per z
b(2j) - bl(jn2N* + j), (1 * j s jn2U*).
2 112
The cost for each compress is n N* 4- -t-(— n N*) + 92 and the total cost to
o 2
118
b
z
bl
final state
1
b
h CD
11,1
11,1
11,1,,,
11,2
0
bn,3
b <X>
11,3,,,
11,3
1
bll,5
b
11,5..,
11,4
0
bll,7
b (1)
11,7 ,,,
b
1
b
b (1)
ij,k
11,5
12,1
12,1.,,
0
b
b CD i
^ k odd,
11 ,6
12,3
b12,3
•
•
•
1 S k S n*
b
b
12,7
12,3
21,1
321,2
321,3
>21,4
1
0
1
0
1
0
b
21,3
b
b
b
b
22,1
22,2
22.3
22.4
0
1
0
1
0
22,3
b
b (1)
22,5
b
°?2 s
b (1>
22,7
22,7
b ,
b (2)
11,2
ll,l.o.
->
b
b (2)
11,4
H,3.o,
b
b (2)
b (2)
11,6
b
b12jl(2)
ij ,k ’
11,8
12,3
>
k odd,
•
•
1 £ k £ N* 2
b , „ „
b ' <2)
12,8
22,3
b
b (3)
21,2
11,1.0,
b
b <3)
21,4
11,2 .o.
b2 1,6
b (3)
12,1,0,
b <3)
b
. ’ (3)
V
ij,k *
21,8
12,2
•
1 s k ^ n* ■ 4
b
b ' (3)
22,8
22,2
J
Fig. 10.2. Odd-even separation for n=2,N=7, N*=8.
119
17 2
odd-even separate a, b, c, and v is — (3n + n)N* + 736. Our initial
o
extension of A to N* block equations makes it possible to perform this
?
step using only 3 vector compress operations, independent of n“ . If A
2
had not been extended we would need either 2 (3n + n) vector operations
2
to compress the vectors one component subvector at a time, or n vector
2
operations to generate a modified control vector consisting of n copies
of z(l:N), followed by the original 3 vector compress operations.
Lambiotte [LI] rejected use of odd-even reduction for n > 1 on this basis.
The net effect of our extended data layout is to reduce the number
of vector operations needed to manipulate vectors for the STAR, and thus
to reduce vector startup costs. On the other hand, we have increased vec-
tor lengths and hence storage from N to N*, and this may be a more impor-
tant consideration in some cases. The modified control vector approach
would then be most natural despite the increased startup costs. In any
3
case, 0(n ) vector operations are needed for the arithmetic part of the
reduction step.
So far we have considered only the odd-even separation to prepare
for the first reduction. In general, the cost of the separation to pre-
17 2
pare for the ith reduction, 1 £ i £ m - 1, is — (3n + n)N* + 736,
o L
ITt = 2m+^ and the total cost for m- 1 reductions is -^-(3n“ + n) (N* - 2)
+ 736(m-l). For the case n = 2 this yields 59. 5N* + 736m - 855. We will
see shortly that this is roughly half as much time as is spent in the
arithmetic operations, so the overhead of dealing with a restrictive
machine vector definition can be considerable.
Data manipulation for the back substitution is simplified by saving
the odd- indexed equations of A^ in appropriate form for vector operations
(see fig. 10.2 for the final state of b) . Having computed x^i+^ = the
120
even components of x^^ , we compute the odd components of x^^ and merge
the two sets into x^ . The merge operation on STAR is comparatively
more expensive than the compress operation, but we are only dealing with
the solution vector and not blocks of A. The cost of the merge operation
at stage i is 3n.\\* + 123, and the total cost for m- 1 reductions is
6n(N'v - 2) + 123(m - 1) . For n = 2 this yields 12N'V + 123m - 147.
p-fold reduction: Suitable data arrangement and use of control vectors
for p-fold reduction can readily be generalized from our discussion of
odd-even reduction. The odd- even separation generalizes to separation
into p vectors, which requires p compress instructions per storage vector;
^ 2 jjyL 1 » j
the cost per reduction would be (p + — ) (3n“ + n)N.* + 363p, N.* - p‘ ,
O 1 1
m = log N 1 . The back substitution requires merging d vectors into 1,
1 1
using p-1 merge instructions at cost of 3 - — lnN.* + 123 (p- 1) to go
, (i+1) (L) _ . , , , .....
trom x to x . It is clear that these costs are minimized bv
Jacobi iterations: Data manipulation is unnecessary when A and v or BrA'
and D[A] ^v are stored according to our component subvector scheme.
10. C. Comparative Timing Analysis.
We close with an execution time comparison of algorithms for general
block tridiagonal systems with 2x2 blocks, based on timing information for
the CDC STAR- 100. This will expand the crude comparison of the LU factor-
ization and odd-even reduction given in Section 4.D, and illustrates many
earlier statements concerning the cost of various methods. Essential con-
siderations include vector startup costs, creating a substantial penalty
for operations on short vectors, and data manipulation to deal with
121
restrictive definitions of vector operands. Section 10. B began our dis-
cussion of the latter problem.
Our initial simplified analysis is based on the following conven-
tions, which allow us to consider only arithmetic costs:
1. we assume
a. vector operands may be any sequence of storage locations
in arithmetic progression,
b. pivoting is never necessary;
2 . we ignore
a. instruction overlap capabilities,
b. address calculations, loop control and termination tests,
c. load/store costs for scalar operands,
d. storage management costs, as a consequence of assumption l.a.
Our second, and more realistic analysis will demand that vector operands
consist of consecutive storage locations, so we must consider storage manage-
ment as explained in Section 10. B. Overlap, loop control and scalar load/
store will also be considered, but not address calculation or termination
of iterations. Algorithms and timing derivations are given in Appendix B;
here we present the main results and make comparisons.
Execution times for the basic direct methods are initially estimated
as follows (m * ’’log2N?):
1. the block LU factorization, using scalar operations only (LU) :
1012N - 7 84
2. odd-even elimination, using vector operations only (OEE) :
94 . 5Nm + 12999m - 93. 5N - 1901
r
122
3. odd-even reduction, using vector operations except for the final
block system (OER) : 106. 5N + 14839m - 14717.5
4. p-fold reduction, p 2 3:
d- 1
145N + 19206m (■
log2p
) - ( 1935 lp - 19579)
We can first dispose of p-fold reduction as a competitive method for gen-
eral problems since either LU or OER is always cheaper. Execution time
ratios show the benefits of using OER in preference to LU and OEE.
N = 1023
N = 8191
limit, N — 00
OER:
LU
.23
.13
.11
OER:
OEE
.24
.11
0.
The importance of vector startups is seen by comparing OER and LU,
for the latter is faster when N < 100. OER works by successively reducing
vector lengths, and at some point this becomes inefficient. A polyalgorithm
combining the two methods is straightforward: if m £ 6 then use LU, other-
wise do m-5 reductions, solve the remaining system by LU, and back substi-
tute. This reduces the time for OER to 106.5 N + 14839m - 46903.5, and
for N 3 1023 it represents a savings of 13 i.
We also note that vector startups are the dominant term for OER when
N £ 1532, and are the dominant term for OEE when N ^ 137. Of course, OEE
uses 0(N log N) operations vs. O(N) for the other methods and soon becomes
inefficient. Nevertheless, it is faster than LU for 455 £ N ^ 1023,
although it is always slower than OER.
Our conclusion is that the OER - LU polyalgorithm is the best direct
method of those considered here.
Now consider the cost of individual eliminations or reductions with
9
123
respect to the total cost of the algorithm. For odd-even elimination,
each elimination step has roughly equal cost, the last being slightly
cheaper. Thus each step costs about Ira of the total cost. For odd-even
reduction, the cost per reduction decreases as the algorithm proceeds:
cost of the kth reduction as a fraction of total cost
1 k ° 1
2
3
4
5
6
total cost
N = 1023, m = 10
00
C^l
17$
12$
3$
7$
7$
242,622
N = 3191, m 3 13
43*
22$
12$
m
4$
3$
1,050,531
(cost for kth red. = 106.5(2m-k) + 14839 when N = 2m - 1)
For small N, when startup costs are dominant, reduction costs are approxi-
mately equal, while for large N, when true arithmetic costs are dominant,
the kth reduction costs about 12 of the whole. This has immediate con-
sequences for the effectiveness of incomplete reduction, since the omitted
reductions are only a small part of the total cost when N is large.
For our comparison of semidirect and iterative methods, the goal will
be to approximate x with a relative error of l/N, or to reduce the initial
error by a factor of l/N. Thus if we perform either k odd-even elimina-
tions or k reductions, we want || B[H^+^]j| £ l/N or || B[A^k+^]|| £ I N,
assuming jj B[A]j|< 1. Theorem 7.1 is then applied to produce the desired
approximation by incomplete elimination or reduction. Taking 9 = j (B [A ] 1 1,
k k
we have || B[H^+^]||, || B[A^k+1^]j| £ » and demanding 3“ ^ l/N yields
k s log2((log2N) / (- log2 9) ) .
The execution time for k eliminations is
94.5Nk + 12999k + 10. 5N - 105(2k) + 1205,
and the time for k reductions is
124
106. 5N + 14839k - 96(2m'k) + 1287.
Typical savings for incomplete reduction over complete reduction would be:
3 = .9, pick k £ log2log2N + 2.718
N = 1023,
k
= 7,
savings ~ 12)5
N = 8191,
k
= 7,
savings = 7^
2
8 = , pick k
2 log2log2N + .774
N = 1023,
k
= 5,
savings = 25^
N = 3191,
k
= 5,
savings = 12 £
0 3 Pick k
log.
log,N
N = 1023,
k
= 4,
savings = 33--
N = 3191,
k
3 4,
savings = 16^
Comparison to the OER - LU polyalgorithm decreases the savings.
We next consider iterative methods and their combination with semi-
direct methods. One step of the block Jacobi iteration costs 22. 5N + 3031,
while one step of the block Jacobi semi- iteration costs 26. 5N + 3408 plus
overhead to estimate parameters such as p = o(B[A1). Execution time may
be reduced to 12N + 1840 (16N 4- 2217) per iteration by directly computing
B[A] and D[A] (cost = 33. 5N + 4367 = 3 iterations).
We will use the Jacobi iteration in its simplest form when only a few
steps are demanded, and Jacobi semi- iteration when many steps are needed.
The number of iterations needed by J-SI to reduce the initial error by a
factor of 1/N are estimated to be
125
The LU factorization is always faster than OEE, and is faster than
OER for N £ 312. Vector startup costs are dominant in OEE for N £ 152,
and in OER for N £ 798. The OER-LU polyalgorithm should perform m-6 reduc-
tions, solve the remaining system by LU, and back substitute. The time is
183N + 16233m - 70677, yielding a 16$ saving over OER for N = 1023 but only
a 3$ saving for N = 3191.
The execution time for odd-even elimination is allocated as follows:
!
N = 1023 N = 3191
arithmetic
96$
96$
overhead
4 $
4$
results
37 $
98$
vector startups
13$
2$
Analysis of the total time for odd-even reduction shows the greater impor-
tance of overhead operations:
I
N = 1023
arithmetic
overhead
N =* 8191
arithmetic
overhead
results
startups
33$
40$
73$
23$
/ d/
27$
56$
results
startups
52$
11$
62$
37$
1$
38$
38$
12$
126
limit, N -* =°
results
startups
/arithmetic
53 3
-
58%
(overhead
42*
-
423
1003
-
The increase in time for OER due to overhead is 37* for N = 1023, 6l3 for
N = 8191, and 723 in the limit as N -* ®. As a consequence of the data
layout of Section 10. B, overhead startup costs are quite small, but the
overhead result costs cannot be ignored and probably cannot be decreased.
The strong requirement that machine vectors consist of contiguous storage
locations leads to much "wasted" effort in execution time.
Most of the execution time of OER is actually spent in the arithmetic
and vector compress operations:
cost of individual operation as fraction of total cost
N = 1023
N = 8191
limit, N -» “
divide
12*
12*
123
multiply
43*
35*
32*
add / subtract
18*
15*
14 3
compress
203
29*
33*
merge
43
6$
73
transmit
3*
3*
3*
The compress operation assumes more
reflecting its low startup costs.
importance as
N increases, again
As noted earlier,
most of OER'
s time is spent
in the first few reduc
tions and subsequent back substitutions.
127
p = .9, 1.5m iterations,
2
p = ^, .75m iterations,
p = -j, .5m iterations.
2
For p = — , for example, J-SI is never faster than OER.
As suggested in Section 7. A, we can combine incomplete elimination or
reduction with Jacobi iterations. If k steps of elimination or reduction
are used, followed by l iterations and back substitution, k and l should
ok l
be chosen so that 3 s L N, or
(*) k + logo ^ 2 log2((log9N) / (-log9 S) ) .
The time for the OER-J method would then be
106. 5N + 14839k - 96(2m-k) + 1287 + 2(19(2m_k) + 2615),
2
and it only remains to minimize the time subject to (*) . For 3 = ■j,
N = 1023, the resulting values are k = 2, i = 5, which represent a savings
. 2
of 37/: over OER. For 3 = — , N = 8191, the resulting values are k = 3,
2=3, which represent a savings of 15$ over OER.
Of course, to determine k and i we need to know 3. While it is true
that 3 = j| B[A]j| can be computed directly from A at a cost of 46. 5N + 3881,
or from B[A] at a cost of 15N + 308, we recognize that half the rows of
B[A] are available as part of the first odd-even reduction. An estimate
of || B[A]|| can be computed from these rows at a cost of 7.5N + 308. For
large N even this cost considerably reduces any benefits gained from using
incomplete reduction, and points out the importance of an analytic estimate
of || B[A]|| rather than a computational estimate.
128
This ends our initial comparison of algorithms. We now make the
assumption that vector operands are consecutive storage locations, allow
overlap of scalar operations, and consider the cost of loop control and
scalar load store operations. The net effect is to lower the time esti-
mate for the LU factorization and raise the time estimate for odd-even
reduction. Odd-even elimination remains substantially unchanged. Rather
than simply recompute the figures presented earlier, which will be done in
a few cases, we try to indicate where the algorithms spend most of their
time.
Revised time estimates are given below. These should give a better
indication of actual runtime on STAR, but are probably all underestimates;
we note changes from the first set of timings, m = i log^N"! .
1. LU factorization, (LU) 600N - 150
(a 40 p decrease due to use of instruction overlap)
2. Odd-even elimination, (OEE) 98.5Nm 4- 13403m - 101. 5N - 2261
(a 4/» increase due to overhead costs)
3. Odd-even reduction. (OER) 183N + 16233m - 16108
(a 10-702 increase due to overhead costs; larger N incurs
larger increases).
Time ratios show the effect of decreasing the LU time and increasing the
OER time
N = 1023
N = 8191
limit, N — “
OER : LU
.54
.34
.31
OER:OEE
.32
.17
0.
129
cost of ith reduction as fraction of total cost
i =
1
2
3
4
5
6
N = 1023
33 4
19#
12 4
34
7 4
6 4
d =8191
45#
23#
12 4
6 i
2$
limit, N -* ®
5016
25 4
134
n
3 4
2 i
Of the total time, 86# is spent in the reduction phase and 14# in the back
substitution .
One of the effects of the OER-LU polyalgorithm is to reduce the impor-
tance of vector startup costs, as seen below.
N = 1023
= 3191
scalar time
14#
2 4
vector time
864
98 4
results
63*
91#
s tar tups
23 4
7 4
startup costs
are already
small not much
ilete reduction and incomplete
reduction +
results. (9
= II B[A] || ).
Incomplete reduction, savings
over OER
N =
1023
N = 3191
k
savings
k
savings
= .9 7
10#
7
5 4
f 5
i 4
CM
5
9 4
1 ‘
27#
4
134
130
Incomplete
reduction
+ Jacobi iteration.
saving over OER
N =
1023
N =
8191
k
J
savings k
l
sav in^s
3 •= .9
3
9
22 1° 4
6
8 i
2
3
2
5
36^ 2
6
1315
_1
47 ^ 2
9
1
5
4
As seen earlier, the seraidirect methods are best applied when the omitted
odd-even reductions make up a significant part of the OER time. There is
considerable advantage to using incomplete reduction plus Jacobi iterations
roughly doubling the savings obtained by simple incomplete reduction.
APPENDIX A
VECTOR COMPUTER INSTRUCTIONS
This section describes those features of the CDC STAR- 100 needed for
our comparison of algorithms in Section 10. Information given here is
derived from CDC reference manuals and timing measurements made at Lawrence
Livermore Laboratory [C2], and by no means represents a complete discussion
of the STAR'S facilities.
Execution times are for full-word (64 bit) floating point instructions,
given as the number of 40 nanosecond cycles needed for each instruction.
All times should be accurate to within 5^ except as noted, though actual
performance may be degraded somewhat due to memory conflicts.
A vector on STAR consists of N consecutive storage locations,
1 £ N £ 65,536. The vector length restriction is usually not serious, but
the need for consecutive storage is crucial. Other vector computers, the
TI ASC in particular, do not have this restriction. Both machines are
pipeline computers with the usual scalar instructions as well as vector
instructions .
Useful features of vector instructions on STAR include:
broadcast constants - Most vector instructions allow a scalar (the broad-
cast constant, held in a register) to be transmitted repeatedly to form
one or both of the vector operands. Use of this facility decreases exe-
cution time in a few cases.
control vectors - Bit vectors can modify the effect of vector instructions,
such as suppressing storage to selected components of the result vector
or determining which operand components are to be used in an operation.
132
Use in vector data manipulation (compress and merge) is explained
later.
offset - Modification to the base address of operands.
sign control - Certain vector operations allow the sign of the operands
to be modified before execution.
Individual classes of instructions follow.
scalar (register) instructions. Issue time is the number of cycles required
to initiate execution. Shortstop time is the time after issue when the
result is first available, but not yet placed in a register. The result
must be used at the precise time given; if not, the result cannot be used
until it is placed in a register. Result-to-register time is the number
of cycles after issue when the result is placed in a register. Overlapped
execution is possible, but the exact
effect
on execution
time is bes
determined by
actual testing, which we have
not done.
STAR
result
instruction
operation
issue
shortstop to
regis ter
62
add, normalized
2
6
11
66
subtract, normalized
2
6
11
6B
multiply, significant
2
10
15
6F
divide, significant
3
-
43
73
square root
3
-
71
78
register transmit
2
4
9
scalar (register) load/store Instructions. Up to three load/store instruc-
tions may be stacked in the load/store unit at any given time. The stack
is processed sequentially at a rate of 19 cycles per load and 16 cycles
133
per store minimum, assuming no memory conflicts. Once a memory bank has
been referenced it remains busy for 31 cycles; during this time further
references to the bank are delayed and the conflict will affect references
to other memory banks. Start to finish for a single load requires at
least 31 cycles, and a minimum of 66 cycles are required to store an item
and reload it.
STAR
instruction
operation
issue
result
to register
load store
unit busy
7E
load
3
28
19
7F
s tore
3
16
branch instructions . Execution time for a branch can range from 8 to 34
cycles, depending on the particular instruction, whether a branch is actu-
ally made, whether the target instruction is in the instruction stack, etc.
We will be only slightly pessimistic if we take 40 cycles to be the time
for a typical branch.
vector operation instructions. Execution time consists of an initial start-
up time followed by sequential appearance of result vector components, or
startup time followed by processing time of operands if a scalar result is
generated. Vector instructions may not be overlapped except for a negligible
part of the startup time. In the table below, N is the length of the input
vector(s) and V is the number of omitted items due to use of a
tor. Times for instructions D8 - DC may vary as much as +20^.
control vec-
134
■
STAR
vector
ins truction
operation
time
effect
82
add, normalized
+ 71
c . — a . + b .
Ill
86
subtract, normalized
jN + 71
c . — a . - b .
ill
8B
multiply, significant
N +159
c . — a . x b .
ill
SF
divide, significant
2N +167
c . «- a . + b .
ill
93
square root
2N +155
c. - *jr.
i i
98
transmit
jN + 91
c . *- a .
l i
99
absolute value
|n + 91
c . «- j a . ]
l 1 l 1
D8
maximum
6N - 2V + 95
c «- max a ,
N 1
DA
summation
6.5N - 2V + 122
c - a .
i-1 1
DB
product
6N - V + 113
c - "N a.
i=l i
N
DC
inner product
6N - V + 130
c - a b
i= 1
vector data
manipulation instructions.
The compress instruction transmi
selected components of one vector into a shorter result vector. Given a
vector a and a control vector z , the instruction essentiallv executes the
I code
j - 1
for i = 1, 2 , . . . , N
if zi = 1 then £c ^ j - j+l}.
The result vector c has length bitsum(z) . The merge instruction "shuffles"
[ two vectors into a third, according to the control vector:
135
length (c) = length(z) = length (a) + length(b)
j - l; k - 1
for i = 1, 2
if z. =
i
length (c)
1 then (V - a ; j - j+l}
else {Ci - b ; k - k+l}
In the table below, = length (a) , ^ = lengthfc), R - number of input
items broadcast from register.
STAR
instruction
BC
BD
operation
onpress a -> c per z
merge a,b - c per z
time
M1 + &2 + 92
3^ - 2R + 123
Execution times for these instructions may vary as much as +20
136
APPENDIX B
SUMMARY OF ALGORITHMS AND TIMINGS
We collect the algorithms discussed in the main text, and present
timing formulae in general terms for a vector computer. This is then
specialized to block tridiagonal systems with 2x2 blocks using timing
information for the CDC STAR- 100. A comparison of methods based on the
latter set of timings is given in Section 10. C. It must be emphasized
that our time estimates have not been verified by direct testing on STAR.
The algorithms solve Ax = v, A = (a . , b . , c.) with blocks of uni-
j j j N
form size n X n. The basic arithmetic operations are: (S = add or sub-
tract, M = multiply, D = divide)
1. factor an n x n block b
12 13 o
cost = — (n - n)D + — (2n - 3n + n) (M + S)
2. having factored b, solve bg = f, where f is n x n'
2
cost * n' (nD + (n“ - n) (M 4- S))
3. product of n x n and n x n' blocks
2 2
cost = n' (n M + (n -n)S)
4. difference of two n x n' blocks
cost = n' (nS)
Depending on context, the symbols S, M, D can mean either a single
scalar operation or a vector operation with vector length specified by
the algorithm.
For each algorithm we give a code in terms of blocks of A, since the
translation to individual scalar or vector operations is generally obvious.
137
In those cases where additional detail seems to be necessary it is pro-
vided. Two basic sets of timings are given, five formulae in all for
each algorithm:
1. counting arithemtic operations only, assuming machine vectors
can be any sequence of storage locations in arithmetic progres-
sion, and ignoring pivoting, instruction overlap, address cal-
culation, loop control, termination tests for iterations, load
store costs for scalar operands, and any other forms of storage
management
(a) for systems with n X n blocks in general terms for a
vector computer
(b) for systems with n x n blocks, with timing data for
the CDC STAR- 100
(c) for systems with 2x2 blocks, with STAR data
2. a more realistic model of STAR, requiring machine vectors to be
contiguous storage locations, allowing overlap of scalar opera-
tions, counting storage management and load'store costs, and
loop control to some extent, but still ignoring Divoting, ad-
dress calculations and termination tests for iterations
(a)
for
systems
with n
X n blocks.
with
STAR
data
(b)
for
sys terns
with 2
X 2 blocks.
with
STAR
data.
Additional notes are provided in many cases. Timings 1(c) and 2(b) are
used in Section 10. C.
A few remarks on the timings are necessary. Since we eventually want
estimates for 2x2 blocks, the only loop control that will be counted is
138
that on. the main loops. Operations on n x n matrices or n- vectors are
assumed to be expanded into straightline code.
The address calculations that have been ignored here would be exe-
cuted in scalar mode, and probably can be completely overlapped with
scalar arithmetic operations in the LU factorization. Their effect on
the odd-even methods would be to increase the 0(ml terms, m = "log^N-1 , but
this probably represents only a small relative increase since the 0(m)
terms consist of vector startup costs and are already large. While scaLar
operations can sometimes destroy the efficiency of a vector code, we be-
lieve that our basic conclusions will not be altered by including address
calculations .
139
1. Block LU factorization, using scalar operations mostly
Algorithm:
di * bi ; £i * vi
for i = 2 , ... , N
solve d. (u. g. ) = (c. f. )
l-l l-l i-l l-l l-l
d . = b . - a . u . .
l i ii-l
fi = vi - ai §1-1
solve dN xN = fN
for i = N-l, ... , 1
X. = g. - U. X.
1 1 11+1
Storage handling:
If desired we can overwrite d. -» b . , L. = a. d.V — a u -» c ,
1 l i ii-l l i l
x. -» g. - f. -* v.. The factorization loop needs to save either
1111
2
f ^ or and g^, a total of n + n stores and subsequent loads,
d., f. are saved then the second loop would be
i i r
for i = N- 1, ... , 1
solve d.x.=f.-c.x. ,
ii i i i+l
with a corresponding increase in execution time.
Variations of the basic algorithm, with estimate of minimum runt
due to load/store unit:
Version 1, a one-time solution.
initialization loads b^, v^
factorization loop loads c. ,, a., b., v.
i-l i l i
stores g^ l
initialization stores x^
d . and
i
If
ime
140
back substitution loop loads u., g.
11
stores x.
1
At 19 cycles per load and 16 per store,
2 2
total time £ {l9(4n“ 4- 2n) + 16(n“ 4- 2n) } (N- 1)
4- 19 (n2 4- n) 4- 16 (n-)
= (92n“ 4- 70n)(N-l) 4- 16n~ 4- 35n
for n = 2, total time 5 503N - 362
Version 2, factor + solve, A = ( ^ , I, 0)(0, d^ , 0 ) (0 , I, u.)
o
Factorization requires n“(3N - 2) loads for A.
Version a. factorization looo stores d. onlv, and comoutes
i
l. - a. d.\, u, = d . ^ c. by vector operations, vector length =
JJJ-ijJJ
total time £ 19n“(3N - 2) + 16n“N + -r^n" - n) D + -r(14n2 - 15n”
Z 6
D = t^N + etc.
for n = 2, total time S 323. 5N 4- 3421
Version b. factorization loop stores d^, u^, computes l. by vector
operations, vector length = N.
total time £ 19n2 (3N - 2) 4- 16n2(2N - 1)
+ \(2n~ - n)D 4- ~(8n2 - 9n~ 4- n) (M4-S)
for n = 2, total time 5 373. 5N + 1769
Version c. factorization loop stores d., u., 1..
ill
total time £ 19n2(3N - 2) + 16n2(3N - 2)
for n =* 2, total time 2 420N - 280
Solution of Ax = v, given the factorization of A.
forward elimination loads v^ initially
loop loads l., v., stores f.
i i i
solve dj gj = f j with vector operations.
N.
4- n) (H4S ) ,
141
back substitution loads initially
loop loads u., g., stores x.
i i r
2
total time s 19(2n (N-l) 4- 2nN) + 16(2nN)
+ -j(n^ + n) D + -jr(2n3 + 3n~ - 5n) (M+-S )
for n = 2, total time 2 302. 5N + 1039
Timings :
1. Version 1, for initial comparison, arithmetic operations without
overlap, load Stores ignored.
(a) t^(3n' + n)tD + ^(14n3 + 9n~ - 5n) (tM + tg ) In
- fn2tD + (2n3 + n2) (t^ + tg) }
(b) with STAR data: (70n3 + 114n~ - 2n)N - (60n3 4- 76n“)
(c) for n = 2, with STAR data: 1012N - 784
Version 1, for second comparison, allowing overlap and including
load/ stores .
In the absence of facilities for direct testing on STAR, a scalar
code was written for the special case n = 2 and time estimates
computed from that code. The details are tedious and wel 1- illustrated
by the simpler tridiagonal code in [L2], so they will not be given
here. The main time estimates are
initialize - loads + loop set up
factorization loop
Initialize
back substitution loop
total time estimated to be
200 cycles
400(N- 1)
250
200(N-1)
600N - 150 cycles
This timing is probably optimistic due to unforeseen memory and
overlap conflicts.
142
Notes :
1. Gaussian elimination, taking advantage of the block structure, using
scalar operations, requires the same execution time as the block LU
factorization .
2. If u., 1 s j sn-1, is saved in the factorization, then ||B[A ]||may
be computed following the first loop at a cost of
( (n- 1) t 4- ~ ) (nN) + (n- 1) a + c
max + max
using the vector maximum instruction. This yields 13N 4- 166 for n =
on STAR.
143
2. Odd-even elimination, using vector operations
We give the algorithm and timing for N = 2™ - 1, deriving formulae
solely in terms of N and m. Timing for other values of N is approximated
by setting m = ^og^N"! .
i- lx
vector
lengths
N, N-r
N-r
N-r
N-2r
N-2r
Algorithm: a ^ ^ = a^, etc.
for i = l,2,...,m (r = 2
for each j = 1,2,...,N
solve b.^(ry. y. c c.) = (a . ^ c.^ v.^)
J J J J J J J
take a. = y. = 0, cp. = 0 if j < 1 or j > N
J J J
(i+1) _ (i) (i) (i)
b. -b. -a. y. -c. a.,
J J J J-r j j+r
(i+l) _ (i) o (i) (i)
v. = v . -a. co . - c . cd.
J J J J-r J J+r
(i+l) _ (i)
a . = -a . or.
J J J-r
(i+l) _ (i)
c . = -c . V . ,
J J J+r
for each j = 1,2,...,N
(nH-1) _ (riH-1)
solve b . x . = v .
J J J
Storage handling: temporary vectors are needed for the factorization of
bj ^ and for , y^ , cd^ computed at each stage, and we can overwrite
. (i+l) . (i) (i+l) (i)
b. -* b . , a v -a. , etc.
J J J J
Timing :
1. arithmetic only
(a) [r(5n2 + n)T + -|(38n3 + 3n2 - 5n) t
l D b M
+ -|(38n2 - 9n2 - 5n)Tg}Nm
+ (4(5n2 + n)aD + -|(38n2 + 3n2 - 5n)a^
+ ^(38n^ - 9n“ - 5n)Cg) } m
■ {'jPn- - n)To + -|(46n3 - 3n“ + 5n)
1,3 2 i
+ ”^(46n - 27n“ + 5n)Tg } N
+ [(2n )t + (2n3 - 2n“)T + -^(n' + n) a
M S 2 D
+ -jr(-10n3 + 3n~ - 5n)<r
6 >1
+ ^r(-10n3 + 15n- - 5n)<rs]
(b) with STAR data,
£-(33n + 19n2 - n)Nm + -^(8740n3 + 2343n~ - 649n)m
-^(46n3 + n2 + n)N - |(2282n3 - 2037n2 + 649n)
(c) for n = 2, with STAR data,
94.5 Mm + 12999m - 93.5N - 1901.
2. With overwriting of a. - c^1'*, it is necessary
to develop the products one row at a time in a temporary vector, and
then use a transmit operation on the whole row. For specifics see
the discussion of odd-even reduction to follow. The added cost for
loop control and transmit instructions is
m- 1 -
40m + , = 1 2n(^n(N-21) + 91) = 40m + n"(N(m-2) 4- 1) + 182n(m-l),
or 4Nm 4- 404m - 8N - 360 for n = 2.
(a) with STAR data, total time is
£•( 38n3 + 23n2 - n)Nm + |(8740n3 4- 2343n2 4- 443n 4- 240)m
-7<46n3 + 9n2 + n)N - 7(2282n3 - 2043n2 + ’74 In)
4 6
(b) for n = 2, with STAR data, total time is
98 . 5Nm 4- 13403m W01.5N - 2261.
145
Notes :
1. || B[ A ] 1 1 may be computed in the first time through the loop at a cost
of ((2n-l)T, + T ) (nN) + ((2n-l)a, + c ). This yields 15N 4- 308
for n = 2 on STAR. || B [ H ^ ] j | may be computed at a cost of 15(N - 2m 1)
+ 303 at the appropriate time through the loop.
146
J
3. Odd-even reduction, using vector operations
As with odd-even elimination, the algorithm and timings are derived
for N = 2m - 1. For general N take n = ("log Nl in the timing formulae,
even though m = riog9N+ln is the proper choice. The algorithm is given in
detailed form, using the data layout and storage handling described in Sec-
tion 10. B. For the basic form, see Section 4.D.
Algorithm: A ^ ^ = A, = v, N* = 2°'
?
set up control vector z = (1 0 1 0 ...), length n“N* bits
cost not counted in timing, but = one instruction on STAR
for i = 1, 2, . ... m-1 (JT = 2"*’1"1' - 1, >h* = 2nH‘1'1)
,, „ ,(i) (i)
odd- even separate A ,
17 2
8 compress operations, cost = — (3n + n)N.* + 736
O 1
(i)
factor t>2j+1v , (0 j <: Ni+1
(i)
factorization overwrites b„ . ,
2j+l
need a temporary vector, length +
cost = 4(n^ - n) D + r(2n'5 - 3n- + n) (M + S) ,
i b
vector lengths = N.
soive b2j+1(l)(*2j+1(l) v2j+1
i+1
(0
®2j+l
(iV
= (a-
(i)
"2 j+1
(i)
2j+l
overwrites a
'22+1
(i)
2 j+1
<° »2j+1(1)>, (0 u s»1+1)
, etc .
need a temporary vector, length
cost 3 (2n + 1) (nD + (n2 - n) (M + S) ) ,
vector lengths = j*
- >2J<“ - -2J
<» v (1) - c
2 j- 1
2 j
(i)
(i)
2 j+1
, (i ^ j * N.+1)
overwrites b^
(i)
need a temporary vector, length
3
cost = 2n (M + S)
14 7
for simplicity, take vector lengths = N *
(i)
(i)
a„, cp.'~'-c v*' cp
O A '04—1 • TO •
(£> cp (i)
2j 92j+l *
(i+1) (i)
V =W2j -“2j +2j-l
overwrites w„ . ^
2 J
need a temporary vector, length
2
cost = 2n (M + S)
for simplicity, take vector lengths = N.
L (i+D =_*(!)* (i) ,7 < , < M .
j 2 j *2j-l ’ (2 j Ni+1)
overwrites a^^ ^ ’ Pr°dvct developed one row at a time:
for k = 1, . . . , n
for l = 1, .... n
hj ■aM.2j<i) °'n,2j-l<1)’ 5-l
for r = 2 , , n
(i)
CU ’ Cl} * *kr.2j 0r«.2].l<i)- <l s J * *1+l>
" ^t.y (1 s L s n; 1 s j SNi+i*)
3 . 3 2
cost = n M + (n -n)S, vector length »
+ n transmits, vector length a nN.+1*.
need two temporary vectors, length = nN. , and N
i+l i+1
c.(i+1) 3 -c, (° v (i), (l s j £ V . i)
j 2 j v2j+l * U J Ni+1
as for a
(i+1)
j
solve Xl(m) * w^13^ , scalar mode
cost = |(n2 + n)D + ^(2n3 + 3n2 - 5n) (M + S)
for i = m- 1, . . . , 1
x (i> = cp (i) . _ (i) _ (i+D
2 j+1 *2 j+1 *2 j+1 Xj
overwrites cd,
(i)
(i) v (i+1)
T2 j+1 xj+l
- V0<1, '"x,., , (0 * j s N. + 1)
2 j+1
cost = 2n (M + S) , vector length »
148
(i+1)
Xj -* temporary
cost = transmit, vector length = nN.+^*
merge x2j+l^’ temPorary “*
cost = 3nN.* + 123
1
Timings :
1. arithmetic only,
(a) {-j(5n2 + n)TD + -|(33n3 + 15n2 - 5n)
+ -^(38n3 + 3n2 - 5n)T }(N-1)
+ {^(5n2 + n)aQ + 'jjOSn3 + 15n2 " 5n)
+ |(33n3 + 3n2 - 5n)a }(m-l)
+ {^(n2 + n) tD + ^(2n3 + 3n2 - 5n) (t^ + tg) ]
(b) with STAR data: ^-(38n3 + 31n^ - n) (N- 1) +
4(3740n3 + 5103n2 - 649n)(m-]) +
o
(10n3 + 38n2 - 2n)
(c) for n = 2, with STAR data: 106. 5N + 14839m - 14717.5
2. with loop control and storage manipulation (compress, merge, transmit)
using STAR data,
(a) for general n, add 80m + ^-(55n2 + 43n)(N-l) + (182n + 950) (m-1)
(b) for n = 2, add 76. 5n + 1394m - 1390.5, so
total time = 183N + 16233m - 16108
L
149
Notes :
1. An estimate for || B[A]|| may be computed in the first time through the
first loop at a cost of ((2n-l)T + r ) (Nn/2) + ((2n-l)a + a ).
This yields 7.5N + 308 for n = 2 on STAR. The estimate of || B[A]|| is
simply the maximum row sum taken over every other block row. The true
value of || [A] || could be found by computing the remaining rows of B[A],
but this is more expensive and defeats the purpose of odd-even reduction.
150
4. p-fold reduction
The algorithm and timings are given for N = p™ - 1, p > 3. For general
N, take m * flog n! = m/log p.
pNi = m/log2?, m = log2N ' ; m =
log N+l is actually the
P
proper choice.
Algorithm: (see Section 6. A for details), N = d
l r
rrrt-1- i
for i = 1, .... in- 1
for each k = 1, .... Ni+1 + 1,
LU process on
for each k = 1, .... JJ. ,
l+l
UL process on ^
generate A<'i+*\
solve b^ x^ =
for i = a-1, . . . , 1
for each k = 1, .... N.. + 1
back substitute for
\p-j- 1 5 J
Timing :
2 13 2
1. (a) for general n, f(5n + n)r +T(2on + 3n - 5n) t ,
+ -j(26n3 - 3n2 - 5n)-rs3 (N-p+1)
+ [5n + n)cD + -j(26n3 + 3n2 - 5n)^
+ -|(26n3 - 3n2 - 5n) cs } (m- 1) (p- 1)
+ !^(n2 + n)tD + ^(2n2 + 3n2 - 5n)(tM+ts)}
13 2
(b) with STAR data, 2(26n + 21n - n) (N - p + 1)
+ -j(5980n + 2769n“ - 649n) (m-l)(p-l)
+ (10n3 + 38n2 - 2n)
151
(c)
using STAR data, n = 2: 145N + (19206m - 19351) (p- 1) + 228
= 145N + 19206m
(19351p - 19579)
2. data manipulation costs are outlined in Section 10. B. We have not
derived the complete timing formulae, as the method is expected to be
too slow.
Notes :
1. In the arithmetic timing, for p s 3, p-fold reduction is always cheaper
than (p-f-l)-fold reduction for any n. Comparison of 2-fold (odd-even)
and 3-fold reduction yields an N(n) such that if N s N(n)then odd-even
reduction is faster. Although we have not determined N(n) , the block
LU factorization will probably be faster than either reduction method
for N < N (n) . Similar results should hold when overhead costs are
included in the time estimates.
152
5 . Mixed odd-even reduction and LU factorization
Algorithm: do k odd-even reductions, solve A('k"rl') x('K+1^ = w<'k+^ by the
LU factorization, and back substitute.
Timing: (based on N = 2m - 1)
1 • (a) [j(5n“ + n) 4- ^(38n3 + 15n2 - 5n)
+ -|(3Sn3 + 3n2 - 5n)-r }(N + 1 - 2m-k)
1 9 13 9
+ IjCSn" + nlo^ + — (38n + 15n“ - 5n)<7^
+ |(33n3 + 3n2 - 5n);r }k
+ {— (3n + n) tQ + -|(14n3 + 9n~ - 5n)(t^ + tg)}(2m k-l)
- (n2tD + (2n3 + n2)(tM + tg)}
The timing difference between odd-even reduction and the mixed algo-
rithm will be an exoression of the form 'y (m-k) - 3 (2m_k) - v , and k
n n n
should be chosen to maximize this expression. For n = 2 using STAR
data, we have -y^ = 14839, 3^ = 905.5, v = 13023, k = m-5, yielding
(c) 106. 5N + 14839m - 46908.5
2. For n = 2 and STAR data, the polyalgorithm time is
183N + 16233k + 417(2 ‘ k) + 33, and the optimal choice is k = m-6,
yielding
(b) 183N + 16233m - 70677
15 3
6 . Incomplete reduction
Algorithm: do k odd-even reductions, solve D[A^k+^ ] y
back substitute to obtain y = y^ ^ =
(k+l) _ (k+1)
W
X.
Timing: (based on N = 2m - 1)
1. (a) {|(5n2 + n)T + -r(38n3 + 15n2 - 5n)T
^ Do M
+ x(38n3 + 3n2 - 5n)r } (N 4- 1 - 2m“k)
D S
+ {jCSn- 4- n)uD + -|(38n3 4- 15n" - 5n)
+ — (38n3 4- 3n“ - 5n)cr„}k
6 a
+ '^(n2 + n)To + |(2n3 + 3n“ - 5n) (t^ + Ts)}(2m"k - 1)
+ {^O- 4- n)a + r(2n3 4- 3n2 - 5n) (e 4- a ) }
L Do Mb
(b) general n, STAR data:
£(38n3 4- 31n2 - n)N 4- |(8740n3 4- 5103n2 - 649n) k
- (9n3 4- 6n2)(2m"k - 1) 4- -|(460n3 4- 1191n2 - 649n)
(c) using STAR data, n = 2: 106. 5N 4- 14839k - 96(2m_k) 4- 1287
2. (a) general n, STAR data:
(b)
^(38n3 4- 86n2 4- 42n)N 4- |(8740n3 4- 5103n2 4- 443n 4- 5700)k
- ^(36n3 4- 79n2 4- 51n)(2m_k - 1)
13 2
4- — (460n 4- 119 In - 1651n) 4- 80(k4-l)
n = 2, STAR data:
183N 4- 16233k - 176.5(2m'k) 4- 1113.5
and
154
7 . Jacobi iteration, serai- Iteration
basic step: = B[A]y^ + D[A]
for each j = 1, . . . , N
, , (i+1) (i) (i)
solve b.y. = v. - a.y. , - c.y. ,
J J J J J-l J J+l
Timing (identical for both vector computer models) :
1. (a) j(n2 + n) (tqN + crQ) + -|(2n3 + 15n^ - 5n)((TM + -g)N + (ct^ + cj)
1. (b) = 2. (a) using STAR data,
^(2n3 + 19n2 - n)N + |(460n3 + 3951n2 - 649n)
1. (c) = 2. (b) using STAR data, n = 2: 22. 5N 4- 3031
Optional preprocessing, times for n = 2, STAR data
(1) solve b.(a. v. cp.) =(a. c. v.),
J J J J J J J
cost = 38. 5N + 4367. The basic step is then
(i+1) _ (i) (i)
y . = co. - a.y . , - v.y . , , ,
J J J J-l J J+l
cost = 12N + 1840.
(2) factor b ^ , cost = 3.5N + 367. The basic step is unchanged from the
original, but cost is 19N + 2634. This is used in the mixed incom-
plete reduction/ Jacobi algorithm.
Semi- iteration, basic step:
y^m3 = result of Jacobi step from y^
- .al (y(m) - + y'"-1’
Vi ■ 1/(1 -
p - p(B[A]) .
p
- ^
Timing, n = 2, using STAR data: 26. 5N + 3408, or 16N + 2217 with the first
preprocessing .
Notes :
]
1. Varga [V3, p. 139] shows that for Jacobi semi- iteration ,
A conservative estimate for the number of iterations needed to obtain
|| e^ || ^ 1/n || e^ || is n = 21ogN (- log (x^- 1) ) • It follows that
the execution time for an iterative solution is 0(N log N) vs. 0(N)
for the better direct methods.
8 . Incomplete reduction plus Jacobi iterations
Algorithm: do k odd-even reductions, solve D[A^k+^ ]y(k+^ = w^k+^ f do
l Jacobi iterations to improve y(k+l)^ and back substitute.
4 , r , • , (k+1) n (k+1) (k+1) , f ‘ ,
As a result of solving D[A Jy = w , we have factors for
D[A^k+^]. This means the cost is (using STAR data, n = 2)
rn m If m. 1/
1. (c) 106. 5N + 14339k - 96(2 K) + 1287 + 2(19(2 ) + 2615).
2. (b^ 133N + 16233k - 176.5(2m"k) + 1113.5
+ 2(19(2m'k) + 2615) .
157
REFERENCES
[ B 1 ] I. Babuska, "Numerical stability in mathematical analysis," IF IP
Congress 1968, Nor th-Hol land , Amsterdam, pp. 11-23.
[B2] I. Babuska, Numerical stability in the solution of tridiagonal
matrices, Rept. BN-609, Inst, for Fluid Dynamics and Appl. Math.,
Univ. of Md., 1969.
[B3] I. Babuska, "Numerical stability in problems of linear algebra,"
SIAM J. Numer. Anal., vol. 9, 1972, pp. 53-77.
[ B4 ] D. W. Bailey and D. E. Crabtree, "Bounds for determinants," Lin.
Alg . ApdI. , vol. 2, 1969, pp . 303-309.
[85] R. E. Bank, "Marching algorithms and block Gaussian elimination,"
in [ B16 , pp. 293-307].
[B6 ] G. H. Barnes, R. M. Brown, M. Kato, D. J. Kuck, D. L. Slotnick,
R. A. Stoker, "The Illiac IV computer," IEEE T-ans . on Como.,
vol. C-17, 1968, pp. 746-757.
[ B 7 ] G. M. Baudet, Asynchronous iterative methods for multiprocessors,
Dept, of Computer Science, Carnegie-Mellon Univ., Nov. 1976.
[B8] F. L. Bauer, "Der Newton- Prozess als quadratisch konvergente
Abkurzung des allgemeinen linearen stationaren Iterationsverfahrens
1. Ordnung (Wittmeyer- Prozess) ," Z. Angew. Math. Mech., vol. 35,
1955, pp. 469-470.
[B9] A. Benson and D. J. Evans, "The successive peripheral block over-
relaxation method," J. Inst. Maths. Applies., vol. 9, 1972,
pp. 68-79.
[BIO] A. Benson and D. J. Evans, "Successive peripheral overrelaxation
and other block methods," J. Comp ■ Phys , , vol. 21, 1976, pp. 1-19.
[Bll] G. Birkhoff and A. George, "Elimination by nested dissection,"
in [Tl, pp. 221-269].
[B12] W. J. Bouknight, S. A. Denenberg, D. E. McIntyre, J. M. Randall,
A. H. Sameh, D. L. Slotnick, "The Illiac IV system," Proc . IEEE ,
vol. 60, 1972, pp. 369-388.
[B13] A. Brandt, Multi-level adaptive techniques I. The multi-grid
method, Rept. RC 6026, IBM T. J. Watson Res. Cen., Yorktown
Heights, N. Y. , 1976.
[B14] J. L. Brenner, "A bound for a determinant with dominant main
diagonal," Proc. AMS, vol. 5, 1954, pp. 631-634.
158
[ B 15 ] C. G. Broyden, "Some condition number bounds for the Gaussian
elimination process," J. Inst. Maths. Applies., vol. 12, 1973,
pp. 273-236.
[B16] J. R. Bunch and D. J. Rose, eds . , Sparse Matrix Computations,
Academic Press, N. Y. , 1976.
[B17] B. L. Buzbee, Application of fast Poisson solvers to the numerical
approximation of parabolic problems. Rent. LA-4950-T, Los Alamos
Sci. Lab., 1972.
[B18] B. L. Buzbee and A. Carasso, "On the numerical computation of para-
bolic problems for preceding times," Math. Comp., vol. 27, 1973,
pp. 237-266.
[B19] B. L. Buzbee, F. W. Dorr, J. A. George and G. H. Golub, "The
direct solution of the discrete Poisson equation on irregular
regions," SIAM J. Numer. Anal., vol. 8, 1971, pp. 722-736.
[B20] B. L. Buzbee, G. H. Golub and C. W. Nielson, "On direct methods
for solving Poisson's equations," S IAM J . Numer . Ana 1 . , vol. 7,
1970, pp. 627-656.
[Cl] A. Carasso, "The abstract backward beam equation," S LAM J . Ma th .
Ana 1 ■ , vol. 2, 1971, pp. '93-212.
[C2] Control Data Corporation, CDC STAR- 100 Instruction execution times,
Arden Hills, Minn., Jan. 1976; in conjunction with measurements
made at Lawrence Livermore Laboratory.
[C3] A. K. Cline, personal communication, 1974.
[C4] P. Concus and G. H. Golub, "Use of fast direct methods for the effi-
cient solution of nonseparable elliptic equations," SLAM J. Numer.
Anal. , vol. 10, 1973, pp. 1103-1120.
[C5] P. Concus, G. H. Golub and D. P. O'Leary, "A generalized conjugate
gradient method for the numerical solution of elliptic partial dif-
ferential equations," in [B16, pp. 309-332].
[C6] S. D. Conte and C. de Boor, Elementary Numerical Analysis, McGraw-
Hill, N. Y., 1972.
M. G. Cox, "Curve fitting with piecewise polynomials," J. Inst.
Maths. Applies., vol. S, 1971, pp. 36-52.
Cray Research, Inc., "Cray- 1 Computer," Chippewa Falls, Wis., 1975.
M. A. Diamond and D. L. V. Ferreira, "On a cyclic reduction method
for the solution of Poisson's equations," SIAM J. Numer. Anal.,
vol. 13, 1976, pp. 54-70.
F. W. Dorr, "The direct solution of the discrete Poisson equation
on a rectangle," SIAM Review, vol. 12, 1970, pp. 248-263.
159
[D3] F. W. Dorr, "An example of ill-conditioning in the numerical solu-
tion of singular perturbation problems," Math. Comp., vol. 25.
1971, pp. 271-283.
[El] S. C. Eisenstat, M. H. Schultz and A. H. Sherman, "Applications of
an element model for Gaussian elimination," in [B16, pp. 85-96].
[E2] D. J. Evans, "A new iterative method for the solution of sparse
systems of linear difference equations," in [R6, pp. 89-100],
[E3] R. K. Even and Y. Wallach, "On the direct solution of Dirichlet's
problem in two dimensions," Computing , vol. 5, 1970, pp. 45-56.
[FI] D. C. Feingold and R. S. Varga, "Block diagonally dominant matrices
and generalizations of the Gerschgorin circle theorem," Pacific J.
Math. , vol. 12, 1962, pp. 1241- 1250.
[F2] D. Fischer, G. Golub, 0. Hald, J. Leiva and 0. Widlund, "On
Fourier-Toeplitz methods for separable elliptic problems," Math ■
Comp . , vol. 28, 1974, pp. 349-368.
[F3] G. von Fuchs, J. R. Roy and E. Schrem, "Hypermatrix solution of
large sets of symmetric positive definite linear equations," Comp .
Math, in Aopl. Mech. and Engng., vol. 1, 1972, pp. 197-216.
[Gl] J. A. George, "Nested dissection of a regular finite element mesh,"
SIAM J. Numer. Anal., vol. 10, 1973, pp. 345-363.
[HI] L. A. Hageman and R. S. Varga, "Block iterative methods for cyclic-
ally reduced matrix equations," Numer. Math., vol. 6, 1964, pp. 106-
119.
[H2] D. E. Heller, "Some aspects of the cyclic reduction algorithm for
block tridiagonal linear systems," SIAM J. Numer. Anal,, vol. 13,
1976, pp. 484-496.
[H3] D. E. Heller, A survey of parallel algorithms in numerical linear
algebra. Dept, of Comp. Sci., Carnegie- Me 11 on Univ. , 1976; SIAM
Review, to appear.
[H4] D. E. Heller, D. K. Stevenson and J. F. Traub, "Accelerated itera-
tive methods for the solution of tridiagonal systems on parallel
computers," J.ACM, vol. 23, 1976, pp. 636-654.
[H5] R. W. Hockney, "A fast direct solution of Poisson's equation using
Fourier analysis," J.ACM, vol. 12, 1965, pp. 95-113.
[H6] R. W. Hockney, "The potential calculation and some applications,"
Methods in Computational Physics, vol. 9, 1970, pp. 135-211.
[H7] E. Horowitz, "The application of symbolic mathematics to a singular
perturbation problem," Proc. ACM Annual Conf., 1972, vol. II,
pp. 816-825.
150
[H8] A. S. Householder, The Theory of Matrices in Numerical Analysis,
Blaisdell, N. Y., 1964.
[Jl] T. L. Jordan, A new parallel algorithm for diagonally dominant
tridiagonal matrices, Los Alamos Sci. Lab., 1974.
[J2] M. L. Juncosa and T. W. Mullikin, "On the increase of convergence
rates of relaxation procedures for elliptic partial difference
equations," J.ACM, vol. 7, 1960, pp. 29-36.
[Kl] W. Kahan, Conserving confluence curbs ill-condition, Dept, of
Comp. Sci., Univ. of Calif., Berkeley, 1972.
[K2] H. T. Kung, "On computing reciprocals of power series, " Numer .
Math. , vol. 22, 1974, pp. 341-348.
[K3] H. T. Kung and J. F. Traub, All algebraic functions can be computed
fast, Dept, of Comp. Sci., Carnegie-Me lion Univ., July 1976.
[LI] J. J. Lambiotte, Jr., The solution of linear systems of equations
on a vector computer, Dissertation, Univ. of Virginia, 1975.
[L2] J. J. Lambiotte, Jr. and R. G. Voigt, "The solution of tridiagonal
linear systems on the CDC STAR- 100 computer," ACM l'rans. Math.
Software, vol. 1, 1975, pp. 308-329.
[Ml] N. Madsen and G. Rodrigue, A comparison of direct methods for tri-
diagonal systems on the CDC STAR-100, Lawrence Livermore Lab.,
1975 .
[M2] M. A. Malcolm and J. Palmer, "A fast direct method for solving a
class of tridiagonal linear systems," CACM, vol. 17, 1974, op. 14-
17.
[M3] W. Miller, "Software for roundoff analysis," ACM Trans. Math.
Software , vol. 1, 1975, pp. 108-128.
[Nl] A. K. Noor and S. J. Voigt, "Hypermatrix scheme for finite element
systems on CDC STAR- 100 computer," Computers and Structures, vol.
5, 1975, pp. 287-296.
[01] J. M. Ortega and R. G. Voigt, Solution of partial differential
equations on vector computers, ICASE, Hamton, Va., 1977.
[02] A. M. Ostrowski, "ijber die Determinanten rait uberwiegender Haupt-
diagonale," Comment. Math. Helv., vol. 10, 1937, pp. 69-96.
[PI] G. Peters and J. H. Wilkinson, "On the stability of Gauss- Jordan
elimination with pivoting," CACM, vol. 18, 1975, pp. 20-24.
[Rl] J. K. Reid, ed., Large Sparse Sets of Linear Equations, Academic
Press, London, 1970.
161
[R2] J. K. Reid, "On the method of conjugate gradients for the solu-
tion of large sparse systems of linear equations," in [Rl,
pp. 231-254].
[R3] F. Robert, "Blocks-H-matrices et convergence des methods itera-
tives classiques par blocks," Lin. Alg . Appl. , vol. 2, 1960,
pp. 223-265. ~
[R4] D. J. Rose and J. R. Bunch, "The role of partitioning in the numer-
ical solution of sparse systems," in [R6, pp. 177-187],
[R5] D. J. Rose and G. F. Whitten, "A recursive analysis of dissection
strategies," in [B16, pp. 59-83].
[R6] D. J. Rose and R. A. Willoughby, eds . , Sparse Matrices and Their
Applications , Plenum Press, N. Y., 1972.
[R7] J. B. Rosser, The direct solution of difference analogs of
Poisson's equation, rep. MRC-TSR- 797 , Math. Res. Cen., Madison,
Wis., 1967.
[51] J. Schroder and U. Trottenberg, "Reduktionsverfahren fur Differ-
enzengleichungen bei Randwertaufgaben I," N'uner . Math., vol. 22,
1973, pp. 37-68.
[Sla] J. Schroder, U. Trottenberg and H. Reutersberg, "Reduktionsver-
fahren fur Dif f erenzengleichungen bei Randwertaufgaben II," Numer .
Ma th . , vol. 16, 1976, pp. 429-459.
[52] H. S. Stone, "ParalleL tridiagonal equation solvers," ACM Trans ■
Math. Software, vol. 1, 1975, pp. 289-307.
[53] H. S. Stone, ed., Introduction to Computer Architecture, Science
Research Associates, Palo Alto, Calif., 1975.
[54] G. Strang and G. J. Fix, An Analysis of the Finite Element Method,
Prentice-Hall, Englewood Cliffs, N. J. , 1973.
[55] P. N. Swarztrauber , "The direct solution of the discrete Poisson
equation on the surface of a sphere," J. Comp. Phys., vol. 15,
1974, pp. 46-54.
[56] P. N. Swarztrauber, "A direct method for the discrete solution of
separable elliptic equations," SIAM J. Numer. Anal., vol. 11,
1974, pp. 1136-1150.
[57] P. N. Swarztrauber and R. A. Sweet, "The direct solution of the
discrete Poisson equation on a disc," SIAM J. Numer. Anal., vol.
10, 1973, pp. 900-907.
[58] P. Swarztrauber and R. Sweet, Efficient FORTRAN subprograms for
the solution of elliptic partial differential equations, Rept.
NCAR-TN/lA- 109, Nat. Cen. for Atmospheric Res., Boulder, Col.,
1975.
162
[S9] R. A. Sweet, "Direct methods for the solution of Poisson's equa-
■ tion on a staggered grid," J . Comp ■ Phys . , vol. 12, 1973, do. 422-
42S .
[510] R. A. Sweet, "A generalized cyclic reduction algorithm," SIAM J.
Numer ■ Anal . , vol. 11, 1974, pp. 506-320.
[ 5 1 1 ] R. A. Sweet, "A cyclic reduction algorithm for block tridiagonal
systems of arbitrary dimension," contributed paper, Symp . on
Sparse Matrix Computations, Argonne Nat. Lab., Sept. 1975.
[Tl] J. F. Traub, ed.. Complexity of Sequential and Parallel Numerical
Algorithms , Academic Press, N. Y. , 1973.
[T2] J. F. Traub, "Iterative solution of tridiagonal systems on parallel
or vector computers," in [Tl, pp. 49-82].
[T3] J. F. Traub, ed. , Analytic Computational Complexity, Academic
Press, N. Y., 1976.
[VI] J. M. Varah, "On the solution of block tridiagonal systems arising
from certain finite difference equations." Math. Comp., vol. 26,
1972, pp. 359-868.
[V2] J. M. Varah, "A comparison of some numerical methods for two-point
boundary value problems,” Math . Comp ■ , vol. 28, 1974, pp. 743-755.
[V3] R. S. Varga, Matrix Iterative Analysis, Prentice-Hall, Englewood
Cliffs, N. J., 1962.
[V4] R. S. Varga, "On recurring theorems on diagonal dominance," Lin .
Alg . Appl ■ , vol. 13, 1976, pp. 1-9.
[Wl] W. J. Watson, "The Tl ASC, a highly modular and flexible super-
computer architecture," AFIPS Fall 1972, AFIPS Press, Montvale,
N. J., vol. 41, pt. 1, pp. 221-229.
[W2 ] 0. B. Widlund, "On the use of fast methods for separable finite
difference equations for the solution of general elliptic problems,"
in [R6 , pp. 121-131].
[W3 ] J. H. Wilkinson, "Error analysis of direct methods of matrix inver-
sion," J.ACM, vol. 8, 1961, pp. 281-330.
[W4 ] W. A. Wulf and C. G. Bell, "C.mmp, a multi-mini-processor," AFIPS
Fall 1972, AFIPS Press, Montvale, N. J., vol. 41, pt. 2, pp. 765-
777 .
[yl] D. Y. Y. Yun, "Hensel meets Newton - algebraic constructions in an
analytic setting," in [T3, pp. 205-215].
II
SECURITY CLASSIFICATION OF THIS PAGE (Whtn Data Enter'd)
REPORT DOCUMENTATION PAGE
t. REPORT NUMBER
4 TITLE (and Su6f/f/e)
READ INSTRUCTIONS
BEFORE COMPLETING FORM
3. RECIPIENT'S CATALOG NUMBER
5. TYPE OF REPORT & PERIOD COVERED
DIRECT AND ITERATIVE METHODS FOR BLOCK
TRIDIAGONAL LINEAR SYSTEMS
Interim
6 PERFORMING ORG. REPORT NUMBER
7. AUTHORS.)
Don Eric Heller
9- PERFORMING ORGANIZATION NAME AND ADDRESS
Carnegie-Mellon University
Computer Science Dept.
i s?i
II. CONTROLLING OFFICE NAME AND ADDRESS
Office of Naval Research
Arlington, VA 22217
8. CONTRACT OR GRANT NUMBERf*.)
MCS75-222-55
N00014- 76-C-0370 ;
NR 044-422
10 PROGRAM ELEMENT. PROJECT, TASK
AREA 4 WORK UNIT NUMBERS
12 REPORT DATE
April 1977
13, NUMBER OF PAGES
169
14 MONITORING AGENCY NAME 9 ADDRESS (II dltlerent front Controlling Office) IS SECURITY CLASS, fo I t hie report)
UNCLASSIFIED
15a, DECLASSIFICATION DOWNGRADING
SCHEDULE
16. DISTRIBUTION STATEMENT (of this Report)
Approved for public release; distribution unlimited.
17 DISTRIBUTION STATEMENT (of the abstract entered In Block 20, It different from Report)
19 KEY WORDS (Continue on reverse side If necessary and Identify by block number)
20 ABSTRACT (Continue on reverse side It necessary and Identity by block number) Block tridiagOnal Systems of
linear equations occur frequently in scientific computations, often forming the
:ore of more complicated problems. Numerical methods for solution of such sys-
:ems are studied with emphasis on efficient methods for a vector computer. A
:onvergence theory for direct methods under conditions of block diagonal domi-
lace is developed, demonstrating stability, convergence and approximation prop-
irties of direct methods. Block elimination (LU factorization) is linear,
:yclic odd-even reduction is quadratic, and higher-order methods exist. The
dd-even methods are variations of the Quadratic Newton iteration for the
DO , J AN^73 1473 edition of i mov «5 is obsolete continued
S/N 0102-014- 6601 ‘
SECURITY CLASSIFICATION of THIS PAGE f«7i»n Dele Entered)
UNCLASSIFIED
liuHITV CLASSIFICATION or THIS PAGEfl »hen Palm Enfrid)
20. abstract continued
inverse matrix, and are the only quadratic methods within a certain reasonable
class of algorithms. Semi-direct methods based on the quadratic convergence
of odd-even reduction prove useful in combination with linear iterations for
an approximate solution. An execution time analysis for a pipeline computer
is given, with attention to storage requirements and the effect of machine
constraints on vector operations.
UNCLASSIFIED
SECURITY CLASSIFICATION OF THIS PAGEnr?i»n Oafs Enfrtd)